Data-driven machine learning models for the quick and accurate prediction of thermal stability properties of OLED materials

Organic light-emitting-diode (OLED) materials have exhibited a wide range of applications. However, the further development and commercialization of OLEDs requires higher-quality OLED materials, including materials with a high thermal stability. Thermal stability is associated with the glass transition temperature (Tg) and decomposition temperature (Td), but experimental determinations of these two important properties genernally involve a time-consuming and laborious process. Thus, the development of a quick and accurate prediction tool is highly desirable. Motivated by the challenge, we explored machine learning (ML) by constructing a new dataset with more than one thousand samples collected from a wide range of literature, through which ensemble learning models were explored. Models trained with the LightGBM algorithm exhibited the best prediction performance, where the values of MAE, RMSE, and R were 17.15 K, 24.63 K, and 0.77 for Tg prediction and 24.91 K, 33.88 K, and 0.78 for Td prediction. The prediction performance and the generalization of the machine learning models were further tested by out-of-sample data, which also exhibited satisfactory results. Experimental validation further demonstrated the reliability and the practical potential of the ML-based model. In order to extend the practical application of the ML-based models, an online prediction platform was constructed. This platform includes the optimal prediction models and all the thermal stability data under study, and it is freely available at http://oledtppxmpugroup.com. We expect that this platform will become a useful tool for experimental investigation of Tg and Td, accelerating the design of OLED materials with desired properties.


Introduction
Organic light-emitting diodes (OLEDs) have attracted considerable attentions in recent years due to their great promises in flat-panel displays, solid-state lighting, and white lighting technologies 1-3 . The commercialization of OLEDs requires high quality OLED devices, in particular for a long lifetime 4 . There are intrinsic and extrinsic factors that affect the lifetime of OLED devices 5,6 . One of the main external factors is temperature.
As known, the temperature of OLED devices can increase due to Joule heating during operation and exposure to high-temperature external environments 7 . However, increasing the thermal stability of OLED materials can strengthen the stability of device performance. Therefore, a large number of researchers have paid attention to the OLED materials with high thermal stability in recent years 8 .
The glass transition temperature (Tg) and decomposition temperature (Td, corresponding to 5% weight loss) are the two most important thermal properties of OLED materials, and exert significant influence on the performance of OLED devices 9 . The OLED devices irreversibly deteriorate when heated above their Tg 7,10 . High Tg and Td values can reduce heat-induced morphology changes, thus enhancing the stability of device performance 11,12 . The experimental Tg and Td values of the OLED materials are generally measured by differential scanning calorimetry (DSC) and thermal gravimetric analysis (TGA). However, before DSC and TGA determination, OLED materials need to be purified by column chromatography or sublimation 13, 14 , which are complicated and time-consuming. Thus, the development of a quick and accurate method to predict Tg and Td is highly desirable. It is generally acknowledged that the thermal stability of OLED materials is closely related to their molecular structures 15 . However, the relationship between molecular structure and the thermal stability such as Tg and Td is complex and has not been elucidated. Machine learning (ML), a key technique used in artificial intelligence, can map the complex relationshiop underlying a large amount of data. ML has been succesfully applied in the fields of medicinal chemistry, environmental risk assessment, organic synthesis, and materials science [16][17][18][19][20][21] . To the best of our knowledge, only two previous studies have used machine learning methods to predict the Tg of OLED materials 22,23 , and there is a significant lack of Td predictions. In 2003, Yin et al. performed a quantitative structureproperty relationship (QSPR) model to predict the Tg of 88 OLED molecules with MAE = 17.9 K by using a multilinear regression (MLR) method 22 23 . These two studies appear to make highly accurate predictions based on a small amount of data (fewer than 100 molecules).
However, the prediction ability of the single ML models in the two previous studies is not reliable and unstable, due to the generalization ability of ML models to unknown compounds depending on the size of the dataset. Unfortunately, a database including the two important properties of Tg and Td of OLED materials has not yet been constructed. However, in the past decade, a significant amount of thermal stability data for OLED materials has been published. While these published data are dispersed across a wide range of literature, they still provide a possible data source for constructing a robust machine learning model.
In order to explore robust and universal ML-based prediction models, we constructed a new dataset containing the experimental Tg data of 1944 small organic molecules and the experimental Td data of 1182 small organic OLED compounds collected from a large amount of literature. Based on the new dataset, we utilized an ensemble learning approach LightGBM algorithm, rather than the single machine learning methods used in previous works, to integrate multiple weak learners to build an entire learner with better prediction performance than that from any of its component. The prediction performance of our models was verified by two types of out-of-sample datasets, exhibiting satisfactory results and confirming the generality of the models. In addition, experimental validation further confirmed the reliability of our prediction models and potential in practical application. More importantly, we built a website including the optimal Tg and Td prediction models coupled with the new dataset, which is freely available at http://oledtppxmpugroup.com. We expect that this website will serve as a useful tool to help experimental investigators quickly and accurately estimate Tg and Td.

Construction of dataset
Unfortunately, there is no existing database that organizes OLED materials and their properties. Currently, the thermal stability data of OLED materials are scattered throughout the literature. Therefore, experimental glass transition temperatures (Tg) for a diverse set of 1944 molecules were collected from a large number of literature using the SciFinder database. These Tg values were measured by DSC. For molecules with multiple recorded entries, an average Tg was used as the output if the variation was less than 40 K. Molecules with a Tg variation larger than 40 K were not included in the dataset. Experimental thermal decomposition temperatures (Td, corresponding to 5% weight loss) for a diverse set of 1182 OLED molecules also were collected from the literature. The Td data for these OLED molecules were measured by TGA. For molecules with multiple Td values, we compared the TGA curves in the literature and take the Td measured by the smoother TGA curve as the final value (vide Fig. S1). Table   S1 lists 13 OLED compounds with multiple recorded entries. As can be seen, the deviation in Tg for compounds reported in different papers is often within 40 K.
However, the Td values of the same compounds reported in different papers have a large deviation, often greater than 40 K. This is because the purity of the compound has a significant influence on the experimental value of Td (corresponding to 5% weight loss).
To accurately measure Td (corresponding to 5% weight loss), the purity of the compound must be high. It should be noted that there are very few compounds with multiple Td records in our dataset.

Descriptors and fingerprints
In the work, molecular descriptors and fingerprints were considered to characterize the molecular structure. Molecular descriptors and fingerprints were calculated by PaDEL-Descriptor version 2.21 24 . The 1D and 2D descriptors and molecular fingerprints were generated by taking into consideration their general applicability as well as their computation cost. The PaDEL-Descriptor software is open source and free, and the calculation of 1D and 2D descriptors and molecular fingerprints is simple and fast. This facilitates the further promotion and use of our thermal stability prediction models.

Molecular descriptors
1D molecular descriptors were generated based on molecular formulas and 2D molecular descriptors were generated based on the atom connection table. 1D and 2D molecular descriptors belong to the class of molecular property-based descriptors. Each molecular descriptor represents a certain feature of a molecule, such as topology or weight. As each molecular descriptor only depicts a specific property of a molecule, a combination of a large number of molecular descriptors can provide more information.
Using information encoded in canonical SMILES (simplified molecular input line entry system), the PaDEL software offered 1444 1D and 2D descriptors. However, not all the descriptors were used for modeling, such as the descriptors which were not computable for all the compounds. The remaining 665 parameters were used for model definition (including aromatic atom count, aromatic bond count, atom count, bond count, estate atom type, extended topochemical atom indices, ring count, topology, topological charge, topological distance matrices, topological polar surface area, XLogP, and weight descriptors).

Molecular fingerprints
Five types of fingerprints (a total of 2741 parameters) were calculated for this research, including CDK fingerprints (1024 bits), CDK extended fingerprints (1024 bits), E-States fingerprints (79 bits), substructure fingerprints (307 bits), and substructure fingerprints count (307 bits). Molecular fingerprints are a subclass of molecular descriptors that can be obtained without quantum-mechanical calculations. They belong to the class of fragment-based descriptors 25 , and they were used in this study due to their high potential for the high-throughput screening of materials. These fragmentbased descriptors are represented as a Boolean array, indicating the existence of the corresponding fragments in the molecule. The descriptions of the molecular fingerprints used in this study are listed in Table S2. CDK fingerprints, CDK extended fingerprints, and E-state fingerprints are a good expression of the molecular backbones.
Substructure fingerprints and substructure fingerprints count provide differentiation for an array of functional groups.

Machine learning algorithms
In the work, we mainly utilized the ensemble learning strategy to construct a comprehensive model by combining base learners. Compared to the single machine learning, ensemble learning not only produces a more stable global model, but also guarantees diminishing uncertainty. Herein, LightGBM, a recent modification of the gradient boosting (GB) algorithm 26  The LightGBM code is available at https://github.com/Microsoft/LightGBM. Other ML algorithms can be found in Scikit-learn package.

Machine learning models for Tg and Td
90% of the Tg and Td dataset was used for model training and the remaining 10% was used for an independent test set. In order to establish robust ML models to predict the thermal stability of OLED materials, 10-fold cross-validation was used to reduce the randomness of sample division and enhance the stability of the obtained ML models.
Performance was measured with the squared correlation coefficient (R 2 ), the mean absolute error (MAE), and the root mean squared error (RMSE).
Selecting suitable descriptors is crucial for Tg and Td prediction tasks. We started with the choice of molecular fingerprints. A potential challenge exists due to the multifold molecular features involved in the thermal stability of OLED materials, because a single molecular fingerprint does not cover all of these features. However, combining different molecular fingerprints may solve this problem. Table S3 and Table S4 show the training and testing results of different Tg and Td prediction ML models with different fingerprints as inputs. Joint fingerprints including CDK fingerprints (1024 bits), CDK extended fingerprints (1024 bits), and substructure fingerprints count (307 bits) show the best performance, implying that the representation of molecular structures by the molecular backbone and functional groups is potentially better for Tg and Td prediction than the use of other fingerprints. Therefore, the three molecular fingerprints (CDK, CDK extended, substructure count, 2355 bits) were combined as an input, denoted SC_2CDK.
In addition, we also compareed the predictive performance of property-based molecular descriptors and the SC_2CDK fingerprints. Table 1 summarizes the Tg and Td prediction results of the LightGBM models. As can be seen, the ML model with 1D and 2D molecular descriptors has better Tg prediction performance than the corresponding ML model with fingerprints. Therefore, the 1D and 2D molecular descriptors provide more important information relevant for Tg prediction compared with fingerprints. However, information contained in property-based descriptors (molecular descriptors) and fragment-based descriptors (fingerprints) can complement each other. 25 Table 1 Table S5 and Table S6

Verification of Tg and Td prediction models
The obtained Tg and Td prediction models based on the LightGBM algorithm were further tested in out-of-sample predictions. Two representative applications are shown herein.  Table S7 and Table S8  In order to clarify the reasons for this large prediction error, 3CzCNPyz was compared with two other compounds that appear in the same literature. 49 The TGA curve of 3CzCNPyz is shown in Fig. S2 and the TGA curves of 2Cz2CNPyz and 4CzPyz are shown in Fig. S3. The compounds 2Cz2CNPyz and 4CzPyz have prediction errors of 2.81 K and -1.79 K, much smaller than the prediction error of compound 3CzCNPyz. it is likely that our model is accurate for the Td prediction of 3CzCNPyz. These results further support the reliability and advantage of the ML prediction models.

Independent testing of Tg and Td predictions for hole-transport materials and electron-transport materials
Organic electron-transport materials (ETMs) and hole-transport materials (HTMs) are show that the optimal models can give satisifactory accuracy for the prediction of Tg and Td of small-molecule organic ETMs and HTMs, confirming the reliability of our models.

Experimental validation
In order to verify the application potential of these ML-based models in practice, the  Table S11 shows their chemical structures and predicted Tg and Td values of these designed compounds.
Herein, we focus on the new compound TPA-2 with the third highest predicted Tg and the highest predicted Td (TPA-2).
Density functional theory (DFT) simulations and time-dependent DFT were performed for TPA-2 before the compound was synthesized. HOMO/LUMO distributions of TPA-2 in the ground state are shown in Fig. S4. The LUMO of TPA-2 is predominantly located on the acceptor, whereas the HOMO is located on the donor. The separated frontier molecular orbitals lead to extremely small theoretical ΔEST values for TPA-2.
The theoretical calculation parameters of TPA-2 were compared with TPA-PZCN, which is a high efficiency red thermally activated delayed fluorescence (TADF) material with an external quantum efficiency close to 30% 86 . As shown in Table S11, TPA-2 has a narrower bandgap (Egap) than TPA-PZCN (2.08 eV vs. 2.32 eV). The calculated S1 of TPA-2 is also smaller than that of TPA-PZCN, implying that TPA-2 may show a longer emission wavelength than TPA-PZCN in the same solvent. The ΔEST of TPA-2 (0.22 eV) is smaller than that of TPA-PZCN (0.25 eV). The spin-orbit coupling (SOC) was also calculated between S1 and T1 in the geometry of T1. The <S1|Hso|T1> of TPA-2 (0.27 cm -1 ) is larger than that of TPA-PZCN (0.13 cm -1 ), indicating that TPA-2 has a good T1→S1 reverse intersystem crossing (RISC) efficiency.
A large oscillator strength (0.1886) of TPA-2 is maintained which benefit radiative transition from S1 to S0. On the basis of these calculation results, TPA-2 is a good candidate for a red-TADF material. Furthermore, our models predict that TPA-2 have a high thermal stability. Thus, TPA-2 was selected for further experimental validation.
The chemical structure and synthetic route of TPA-2 are presented in Scheme 1. Before testing, the compound was purified by column chromatography and temperaturegradient vacuum sublimation. The structure of TPA-2 was characterized by 1 H NMR and 13 C NMR (vide Fig. S5, Fig. S6 and Fig. S7). The emission maxima of TPA-2 in toluene solution is greater than 600 nm and ΔEST of TPA-2 is 0.07 eV, indicating that TPA-2 can be used as a red-TADF OLED material ( Fig. S8 and Table S12). The thermal properties of TPA-2 were determined by differential scanning calorimetry (DSC) and thermogravimetric analysis (TGA) under a nitrogen atmosphere. A (Tg) of 411 K (138 °C ) and (Td) of 697 K (424 °C) were observed (Fig. 7), in good agreement with the predicted values by machine learning. The predicted Tg value is 426 K, demonstrating an error of 15 K, while the predicted Td value is 738 K, demonstrating an error of 41 K. As expected, the TPA-2 compound has good thermal stability. These results show that it is feasible to apply our ML models to predict the thermal stability of unknown OLED materials. Our ML models could be served as a useful tool to quickly screen high thermal stability OLED materials.
Scheme 1 Chemical structure and synthetic route of TPA-2.

Website for Tg and Td Predictions
Currently, hundreds of articles about OLED materials are published every year 8 . There are a lot of useful data in the literature, but there is no existing database that organizes OLED material data. With the aims of archiving the thermal stability data of OLED materials and helping experimental scientists utilize the models reported in this paper for designing new OLED compounds with desired Tg and Td values, an online tool was developed. This website is accessible at http://oledtppxmpugroup.com. Users can make predictions by inputting canonical SMILES, and the outputs include Tg (K) and Td (K).
The Tg and Td data in this article are also placed on this website. A screenshot of the website homepage interface is shown in Fig. 8. More details can be found by visiting the website. We will continue updating the dataset and optimal model on the website in order to more accurately predict Tg and Td of OLED materials. to support the data-driven ML models. With the dataset and the combined descriptors, the optimal LightGBM models offer satisfactory accuracy for the prediction of Td and Tg, with higher accuracy than other six classic ML models (SVM, PLS, LASSO, KRR, kNN, and RF ML models). The models are further validated by two types of out-ofsample prediction (including recently reported host and guest materials as well as organic ETMs and HTMs), exhibiting good robustness and universality. Finally, the experimental validation of a high thermal stability OLED material further confirms the reliability of our models and practical application potential. In addition, we constructed a website including all the data and the optimal ML models in order to provide a simple and quick tool for estimating these two important properties for unknown compounds.
We believe this website will assist with the design of future OLED materials.

Author contributions
Xuemei Pu and Zhiyun Lu designed the research. Yihuan Zhao performed the research.
Caixia Fu and Ling Fu contributed to the model construction and data analysis. Caixia Fu performed the experimental synthesis. Xuemei Pu and Yihuan Zhao wrote the manuscript. All authors reviewed the manuscript.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.