Data-driven machine learning models for the quick and accurate prediction of T g and T d of OLED materials

Organic light-emitting-diode (OLED) materials have exhibited a wide range of applications. However, further development and commercialization of OLEDs requires higher-quality OLED materials, including high thermal stability associated with the glass transition temperature ( T g ) and decomposition temperature ( T d ). Experimental determinations of the two important properties genernally involve a time-consuming and laborious process. Thus, it is highly desired to develop a quick and accurate prediction tool. Motivated by the changelle, we explored machine learning by constructing new dataset with more than one thousand samples collected from a wide range of literaturesm, through which ensemble learning models were explored. Models trained with the LightGBM algorithm exhibit the best prediction performance, where the values of MAE, RMSE, and R 2 are 17.15 K, 24.63 K, and 0.77 for T g prediction, 24.91 K, 33.88 K, and 0.78 for T d prediction. The prediction performance and the generalization of the machine learning models are further tested by out-of-sample dataset, also exhibiting satisfactory results. Experimental verification further demonstrates the reliability and the practical potential of the ML-based model. In order to extend the practice application of the ML-based models, an online prediction platform was constructed, including the optimal predition models and all the thermal stability data under study, which are freely available at http://oledtppxmpugroup.com. We expect that they will become a useful tool for experimental investigations on T g and T d , in turn accelerating the design of the OLED materials with high performance.


Introduction
Organic light-emitting diodes (OLEDs) have attracted considerable attentions in recent years due to their great promises in flat-panel displays, solid-state lighting, and white lighting technologies. [1][2][3] The commercialization of OLEDs requires high quality OLED devices, in particular for a long lifetime. 4 There are intrinsic and extrinsic factors that affect the lifetime of OLED devices. 5,6 One of the main external factors is temperature.
As known, the temperature of OLED devices can increase due to Joule heating during operation and exposure to high-temperature external environments. 7 Accordingly, enhancement in the thermal stability of OLED materials can improve the device performance. Recently, a large number of researchers have paid attention to the OLED materials with high thermal stability. 8 The glass transition temperature (Tg) and decomposition temperature (Td, corresponding to 5% weight loss) are the two most important thermal properties of OLED materials, and exert significant influence on the performance of OLED devices. 9 In particular, the Tg of OLED materials is one of the most important factors that influences device stability and lifetime since OLED devices irreversibly deteriorate when heated above their Tg. 7,10 High Tg and Td values can reduce heat-induced morphology changes, thus enhancing the stability of device performance. 11,12 On experiments, the Tg and Td values of the OLED materials are generally measured by differential scanning calorimetry (DSC) and thermal gravimetric analysis (TGA).
However, before DSC and TGA determination, OLED materials need to be purified by column chromatography or sublimation, 13,14 which are complicated and time-consuming. Thus, it is highly desired to develop a quick and accurate method to predict Tg and Td. As accepted, the thermal stability of the OLED material is closely related to their molecule structures. 15 However, the relationship between the molecular structure and the thermal stability like Tg and Td is complex and has not been elucidated so far.
Machine learning (ML), as a key technique of artificial intelligence, can map the complex relationshiop underlying a large amoun of data, which has been succesfully applied in the fields of medicinal chemistry, environmental risk assessment, organic synthesis, and materials science. [16][17][18][19][20][21] To our best knowledge, only two previous studies already used the machine learning method to predict the Tg of OLED materials, 22 OLED materials with R 2 = 0.963, MAE = 0.97 K for test set (not independent test set) by using support vector machines (SVM) 23 . The two studies look like high prediction accucracy, which comes from so small amount of data (fewer than 100 molecules). In fact, the prediction ability is not reliable and unstable, in particular for its generalization on unknow compounds due to the dependence of ML on the dataset size. Unfortunately, there has been lack of the database involving the two important properties Tg and Td so far. However, in the past decade, a significant amount of thermal stability data for OLED materials has been published, in particular for thermally activated delayed fluorescence (TADF) materials that are pure organic molecules showing a potential 100% internal quantum efficiency without the aid of heavy metals. 24 Desipte these published data dispersive in different literatures, they still provide a possible data source for constructing robust machine learning model.
Motivated by the challenge, we construct a new dataset containing the experimental Tg data of 1944 small organic molecules and the experimental Td data of 1182 small organic OLED compounds collected from a large amount of literature. Based on the new dataset, we ulitized ensemble learning approach LightGBM algorithm, rather than single machine learning methods in previous works, to build comprehensive models between molecular structure and the two properties by constructing and combining base learners. The optimal Tg prediction model could provide an accurate prediction with RMSE = 24.63 K, MAE = 17.15 K, and R 2 = 0.77 for the test set. For Td prediction, the optimal model could provide an accurate prediction with RMSE = 33.88 K, MAE = 24.91 K, and R 2 = 0.78 for the test set. In addition, the optimal models could accurately predict the Tg and Td for out of sample including recently reported OLED host and guest materials, organic electron-transport materials and hole-transport materials. Then, we used the optimized models to predict the Tg and Td of 50 unknown OLED molecules designed by us and selected the compound TPA-2 (with high Tg and the highest Td) to experimentally synthesis and determine its thermal stability. The experimental verification further confirm the prediction reliability of our ML models. In addition, we explore a websit including the optimized Tg and Td prediction models coupled with the new dataset, which is freely available at http://oledtppxmpugroup.com. We expect that the website will serve as a useful tool to help experimental investigators quickly estimate Tg and Td.

Construction of dataset
Unfortunately, there is no existing database that organizes OLED materials and their properties. Currently, the thermal stability data of OLED materials are scattered throughout the literature. Therefore, experimental glass transition temperatures (Tg) for a diverse set of 1944 molecules were collected from a large number of literature using the SciFinder database. These Tg values were measured by DSC. For molecules with multiple recorded entries, an average Tg was used as the output if the variation was less than 40 K. Molecules with a Tg variation larger than 40 K were not included in the dataset. Experimental thermal decomposition temperatures (Td, corresponding to 5% weight loss) for a diverse set of 1182 OLED molecules also were collected from the literature. The Td data for these OLED molecules were measured by TGA. For molecules with multiple Td values, we compare the TGA curves in the literature and take the Td measured by the smoother TGA curve as the final value (Fig. S1). Table S1 lists 13 OLED compounds with multiple recorded entries. As can be seen, the deviation in Tg for compounds reported in different papers is often within 40 K. However, the Td values of the same compounds reported in different papers have a large deviation, often greater than 40 K. This is because the purity of a compound has a significant influence on the experimental value of Td (corresponding to 5% weight loss). It should be noted

Descriptors and fingerprints
In the work, molecular descriptors and fingerprints were considered to characterize the molecular structure. Molecular descriptors and fingerprints were calculated by PaDEL-Descriptor version 2.21. 25 The 1D and 2D descriptors and molecular fingerprints were generated by taking into consideration their general applicability as well as their computation cost. The PaDEL-Descriptor software is open source and free, and the calculation of 1D and 2D descriptors and molecular fingerprints is simple and fast. This facilitates the further promotion and use of our thermal stability prediction models.

Molecular descriptors
1D molecular descriptors were generated based on molecular formulas and 2D molecular descriptors were generated based on the atom connection table. 1D and 2D molecular descriptors belong to the class of molecular property-based descriptors. Each molecular descriptor represents a certain feature of a molecule, such as topology or weight. As each molecular descriptor only depicts a specific property of a molecule, a combination of a large number of molecular descriptors can provide more information.
Using information encoded in canonical SMILES (simplified molecular input line entry system), the PaDEL software offered 1444 1D and 2D descriptors. However, not all the descriptors were used for modeling, such as the descriptors which were not computable for all the compounds. The remaining 665 parameters were used for model definition (including aromatic atom count, aromatic bond count, atom count, bond count, estate atom type, extended topochemical atom indices, ring count, topology, topological charge, topological distance matrices, topological polar surface area, XLogP, and weight descriptors).

Molecular fingerprints
Five types of fingerprints (a total of 2741 parameters) were calculated for this research, including CDK fingerprints (1024 bits), CDK extended fingerprints (1024 bits), E-States fingerprints (79 bits), substructure fingerprints (307 bits), and substructure fingerprints count (307 bits). Molecular fingerprints are a subclass of molecular descriptors that can be obtained without quantum-mechanical calculations. They belong to the class of fragment-based descriptors, 26 and they were used in this study due to their high potential for the high-throughput screening of materials. These fragmentbased descriptors are represented as a Boolean array, indicating the existence of the corresponding fragments in the molecule. The descriptions of the molecular fingerprints used in this study are listed in Table S2. CDK fingerprints, CDK extended fingerprints, and E-state fingerprints are a good expression of the molecular backbones.
Substructure fingerprints and substructure fingerprints count provide differentiation for an array of functional groups.

Machine learning algorithms
LightGBM is a recent modification of the gradient boosting (GB) algorithm. 27

Machine learning models for Tg and Td
90% of the Tg and Td dataset was used for model training and the remaining 10% was used for an independent test set. In order to establish robust machine learning models to predict the thermal stability of OLED materials, 10-fold cross-validation was used to reduce the randomness of sample division and enhance the stability of the obtained machine learning models. Performance was measured with the squared correlation coefficient (R 2 ), the mean absolute error (MAE), and the root mean squared error (RMSE).
Selecting suitable descriptors is crucial for Tg and Td prediction tasks. We started with the choice of molecular fingerprints. The LightGBM algorithm was used to evaluate their prediction performance. A potential challenge exists due to the multifold molecular features involved in the thermal stability of OLED materials, because a single molecular fingerprint does not cover all of these features. However, combining different molecular fingerprints may solve this problem. Table S3 and Table S4 show the training and testing results of different Tg and Td prediction machine learning models with different fingerprints as inputs. Joint fingerprints including CDK fingerprints (1024 bits), CDK extended fingerprints (1024 bits), and substructure fingerprints count (307 bits) show the best performance, implying that the representation of molecular structures by the molecular backbone and functional groups is potentially better for Tg and Td prediction than the use of other fingerprints. Therefore, the three molecular fingerprints (CDK, CDK extended, substructure count, 2355 bits) were combined as an input, denoted SC_2CDK. Table 1 summarizes the Tg and Td prediction results of the LightGBM models. As can be seen, the machine learning model with 1D and 2D molecular descriptors has better Tg prediction performance than the corresponding machine learning model with fingerprints. Therefore, the 1D and 2D molecular descriptors provide more important information relevant for Tg prediction compared with fingerprints. However, information contained in property-based descriptors (molecular descriptors) and fragment-based descriptors (fingerprints) can complement each other. 26 Table 1 shows  Table 1 shows that machine learning model with fingerprints has better Td prediction performance than the corresponding machine learning model with 1D and 2D molecular descriptors, indicating that fingerprints (fragment-based descriptors) can provide more important information relevant to Td compared with 1D and 2D molecular descriptors (property-based descriptors). This is because the thermal decomposition of OLED materials often starts at a specific molecular fragment, usually the weak bond in a functional group. Therefore, fragment-based descriptors can provide more important information relevant to Td prediction. The combination of molecular descriptors and SC_2CDK only slightly improves the performance of Td prediction. The best result for  Table S5 and Table S6.
In addition to the LightGBM algorithm, the performance and efficiency of other models including SVM, PLS, LASSO, KRR, kNN, and RF algorithms were examined. A comparison of the predictive powers of these seven machine learning methods was undertaken based on the MAE and RMSE of Tg and Td prediction (where the input was descriptors + SC_2CDK). The MAE and RMSE of the independent test set for the different machine learning methods are shown in Fig. 5. As can be seen, the LightGBM regressor exhibits the lowest MAE and RMSE for Tg and Td prediction. Based on these results, the LightGBM algorithm was selected as the optimal algorithm for thermal stability prediction of OLED materials.

Model application and verification
The goal of machine learning model construction is enabling the use of the model in practical applications. The obtained Tg and Td prediction models based on the LightGBM algorithm were further tested in out-of-sample predictions and experimental verification. Three representative applications are shown herein.

Independent testing for Tg and Td predictions of OLED materials
To verify the effectiveness of the machine learning models, they were further tested in out-of-sample predictions. The optimal models were applied to the prediction of Tg for 40 OLED compounds and the prediction of Td for 40 OLED compounds reported in recent literature. 14, 28-55 These compounds are mainly used in host-guest emissive layer for OLED devices. More detailed information about these compounds can be found in the supporting information (Table S7 and Table S8). Meanwhile, these compounds were However, one compound demonstrated a very large Td prediction error (3CzCNPyz, with an error of 75.00 K).
In order to clarify the reasons for this large prediction error, 3CzCNPyz can be compared with two other compounds that appear in the literature. 50 The TGA curve of 3CzCNPyz is shown in Fig. 6 and the TGA curves of 2Cz2CNPyz and 4CzPyz are shown in Fig. S2. The compounds 2Cz2CNPyz and 4CzPyz have prediction errors of 2.81 K and -1.79 K, much smaller than the prediction error of compound 3CzCNPyz. K. This is in good agreement with the experimental values. Therefore, it is likely that our model is accurate for the Td prediction of 3CzCNPyz. This example demonstrates that the purity of a compound must be high for the accurate measurement of Td. Because the experimental determination of Tg and Td requires high purity OLED compounds and is time-consuming and labor-intensive, the prediction of Tg and Td based on a machine learning approach is much more convenient.

Independent testing of Tg and Td predictions for hole-transport materials and electron-transport materials
Organic electron-transport materials (ETMs) and hole-transport materials (HTMs) are widely used in OLEDs and perovskite solar cells (PSCs), mainly for the electron transport layer or hole transport layer of OLED and PSCs devices. Because electron transport layer and hole transport should be thermally stable to improve the overall lifetime of devices, both materials require high thermal stability. Realizing accurate Tg and Td prediction of organic ETMs and HTMs prior to experimental synthesis will be useful for the development of ETMs and HTMs with expected properties. Therefore, to verify the practicality of this study's models, the models were used for Tg prediction of ETMs and HTMs can be found in Table S9 and Table S10. These compounds were not included in the original dataset.
A plot of predicted vs. experimental Tg values for 40 organic ETMs and HTMs is shown in Fig. 7a

Experimental verification
In order to verify that these models can be used for the Tg and Td prediction of unknown OLED compounds and for the screening of OLED materials with high thermal stability, the models were used to predict the  Table S11. TPA-2 has a narrower bandgap (Egap) than TPA-PZCN (2.08 eV vs. 2.32 eV). The calculated S1 of TPA-2 is also smaller than that of TPA-PZCN. This reveals that TPA-2 may show a longer emission wavelength than TPA-PZCN in the same solvent. The ΔEST of TPA-2 (0.22 eV) is smaller than that of TPA-PZCN (0.25 eV). The SOC was also calculated between S1 and T1 in the geometry of T1. The <S1|Hso|T1> of TPA-2 (0.27 cm -1 ) is larger than that of TPA-PZCN (0.13 cm -1 ), indicating that TPA-2 has a good T1→S1 reverse intersystem crossing (RISC) efficiency. A large oscillator strength (0.1886) of TPA-2 is maintained which benefit radiative transition from S1 to S0. On the basis of these calculation results, TPA-2 is a good candidate for a red-TADF material, providing a further reason for its selection for synthesis verification.
The chemical structure and synthetic route of TPA-2 are presented in Scheme 1. Before testing, the compound was purified by column chromatography and temperaturegradient vacuum sublimation. The structure of TPA-2 was characterized by 1 H NMR and 13 C NMR (Fig. S4, Fig. S5 and Fig. S6). Experimental test results show that TPA-2 can be used as a red-TADF OLED material ( Fig. S7 and Table S12). The thermal properties of TPA-2 were investigated by differential scanning calorimetry (DSC) and demonstrating an error of 15 K, while the predicted Td value is 738 K, demonstrating an error of 41 K. As expected, the TPA-2 compound has good thermal stability. These results show that it is feasible to apply our machine learning models to predict the thermal stability of unknown OLED materials. Our machine learning models also have the potential to screen high thermal stability OLED materials. Scheme 1 Chemical structure and synthetic route of TPA-2.

Website for Tg and Td Predictions
Currently, hundreds of articles on OLED materials are published every year 8 . There are a lot of useful data in the literatures, but there is no existing database that organizes OLED materials data. With the aim of storing the thermal stability data of OLED materials and helping experimental scientists to utilize the model in designing new OLED compounds with desired Tg and Td, an online tool was developed. The website to share the available models for the prediction of Tg and Td is accessible at http://oledtppxmpugroup.com. Users can make predictions by inputting canonical SMILES, the outputs include Tg (K) and Td (K). The Tg and Td data in this article are also placed on this website. In the future, we will continue updating the dataset and optimal model on the website in order to predict Tg and Td more accurately. The screenshot of the interface our website homepage are shown in Fig. 9. More details can be found by visiting the website. Finally, the expermimental validation on a high thermal stability OLED material further confirms the reliability of our models and application potential in practice. In addition, we constructed a website including all the data and the optimized ML models in order to provide a simple and quick tool for estimating the two important properties of unknown compounds, in turn assisting the design of the OLED materials.