A STUDY OF BOOSTING MOLECULAR DESCRIPTORS WITH QUANTUM-DERIVED FEATURES IN PREDICTION OF MAXIMUM EMISSION WAVELENGTHS OF CHROMOPHORES

The following research assesses the capability of machine learning in predicting maximum emission wavelengths of organic compounds. The predictions are based on molecular descriptors and fingerprints widely applied in cheminformatics. In an effort to further improve accuracy, developed machine learning models were enriched with quantum mechanics derived features. Multi linear, gradient boosting and random forest regressions were applied. Computers were trained and tested with database of experimental data of optical properties.


Introduction
Machine learning gains a lot of focus these days. The wide spectrum of tools [1,2] and rushing growth of database volumes cause machine learning to influence every aspect of life [3]. When searching for machine learning applications in chemistry one may get an impression that the subject is dominated by drug discovery. Due to growing attention of machine learning, novel applications and prospects emerge [4] along with literature covering the subject. [5,6,7] Quantitative structure -activity/property relation (QSAR/QSPR) methods and algorithms are founded on an assumption that molecular structure is correlated with molecule's properties. In order to translate chemical structures to computer representations a wide range of molecular descriptors were developed [8], alongside with molecular fingerprints like Morgan fingerprint [9] and MACCS keys. Cherkasov et al [10], Varnek and Baskin [11] give a wide overview of the core of QSAR/QSPR and cheminformatics, their history, advances and perspectives.
With the growing number of freely accessible databases and open source tools(eg. Python [12], RDKit [13], Scikit Learn [14], Matplotlib [15] and Seaborn [16]) it is easy to learn and apply machine learning or at least conduct data driven research. The cheminformatics gain also from openness of researchers that publish their code and data pipeline [17]. Such attitude is simplifying the knowledge acquiring process and making QSAR/QSPR and machine learning adaptable to other problems. making predictions of maximum emissions of compounds from test sets. The predictions were checked with real values from test sets and the errors were studied. The process of splitting, training and testing was repeated tenfold in cross validation. Best performing machine learning models were chosen based on mean absolute error, mean squared error, maximum error and R 2 parameter. Chosen models are to be validated with compounds from laboratory.

Datasets
The organic compounds optical properties database [18] contains over 20,000 rows which are combination of 7,016 chromophores in 365 solvents and 17 solid matrices or in solid states. Chromophores included in the database consist of maximum 150 atoms (except H) of C, N, O, S, F, Cl, Br, I, Se, Te, Si, P, B, Sn, Ge. Out of these chromophores there are 956 that have reported properties in solid states (column Solvent and Chromophore are equal). 897 of solid state chromophores have non null value of maximum emission wavelength (nm)(dataset 1). Only solid state compounds were taken into account to avoid bias caused by solvent effects on maximum emission. A subset of compounds containing only C, O, N, F, H atoms was also examined in the study (dataset 2).
The QM9 database [19,20] contains 133,885 small organic compounds of up to 9 atoms (except H) of C, O, N, F. These compounds are a subset of GDB-17 chemical universe database [21] containing 166 billion of organic compounds. The subset includes various density functional theory(DFT) calculated quantum properties (eg. HOMO and LUMO eigenvalues). The QM9 database was downloaded from MoleculeNet [22] since it is packed into .csv format.
The extent of maximum emission of datasets is shown in figure 2.

Machine learning models
Random Forest Regression(RFR), Multi Linear Regression(MLR) and Gradient Boosted Regression(GBR) models from Scikit-Learn [14] Python module were taken into the studies. At the beginning they were utilised with default parameters. A chosen subset of parameters were later optimised. The process is described in "Algorithms optimisation" section.

Feature engineering
All of available RDKit molecular descriptors(208), MACCS keys(167) and Morgan fingerprints(1024) were calculated for every chromophore. Numbers of heteroatoms were also calculated(14 for dataset 1 and 3 for dataset 2). Values of molecular descriptors were further scaled. Features that did not change across datasets were deleted. Molecular descriptors, MACCS keys, Morgan fingerprints and numbers of heteroatoms were applied to all of tested models and they will be further referenced as universal features. After all basic data processing procedures applied, the dataset 1 contained 896 chromophores and 1312 features and dataset 2 contained 523 chromophores and 1127 features.
Compounds from chromophores database were examined if they contain substructures from QM9 database using RDKit built-in function of substructure recognition. In order to provide machine learning models with more features, in order to improve their prediction capabilities, various additional quantities were calculated from substructures quantum properties. Karelson et al [23] covered usage of quantum modelling calculations as descriptors in QSAR/QSPR research, although the calculations were prosecuted with whole molecules not their fragments.
Finally 14 different models were trained and tested with following features.

Model 1
No QM9 based features were calculated. Only universal features were applied to ML models. This approach demands the least computational time of all models covered in this paper as it does not need searching for QM9 database substructures.

Model 2
The sum of all quantum features from QM9 database multiplied by number of pattern(substructure) occurrences was calculated.
where i -index of recognised pattern, n -number of pattern occurences.

Model 3
Only QM9 based features from model 2 were input into ML algorithms. With this approach it is possible to assess if non standard features are competitive to traditional ones.

Models 4 -14
Features generated in these models are result of various mathematical operations of mostly eigenvalues HOM O , LU M O , polarizability, α, dipole moment, µ, zero point vibrational energy, zpve and electronic spatial extent, R 2 . They were developed in the beginning of the research, before applying molecular descriptors and fingerprints. The equations allowing to calculate the values are gathered in the supplemental online material (please refer to avability of data and materials section).
It is worth noting that the RDKit built-in method to detect substructures may yield invalid results. In figure 3 is a chromophore from the database and substructures from QM9 database that were detected in the molecule. The last of detected substructures (circled) is not present in the molecule from database. Figure 3: Molecule with its detected substructures. The circled one is a mismatch.
In the study the faultily detected substructures were not reviewed and were taken into account when features were generated.

Results and Discussion
All developed machine learning models were scored using mean absolute error(MAE) and mean squared error(MSE).
To further assess models' performance, R 2 and maximum errors were calculated. Scoring values are presented in tables 1 and 2 regarding dataset 1 and dataset 2 accordingly. Since ensemble algorithms (random forest and gradient boosting) outperformed linear regression they will be covered separately. When quantum derived features are applied there are minimum changes in prediction accuracy. The decreased performance of model 3 is a result of exclusion of molecular descriptors in training and predicting process.
Mean absolute error indicates that RFR perform better than GBR ( fig. 4), particularly when trained on dataset 1. There is also improvement in performance when models are trained on dataset 2.
Models trained and tested on dataset 2 perform about 3nm better on average (MAE). Most probably this phenomenon is caused by better homogenity in compounds classes in dataset 2. The dataset of compounds composed only of C, O, N, F atoms also lacks maximum emission outliers which could affect the performance of prediction.
In the opposition to MAE, the values of mean squared error ( fig. 5) imply that gradient boosting performs better than random forest algorithm.
The further evidence of GBR's more accurate predictions fall to maximum error ( fig. 6). The trend is that GBR perform better than RFR and the first's worst predictions are about 9nm more accurate then the second's. There is also about 35nm difference in maximum error between predictions with models trained on different datasets. Figure 7 shows values of R 2 scoring indicator. The difference between models is very slight but the advantage of models trained on dataset 2 is further acknowledged.

Multi Linear Models
Since scoring values of multi linear regression in most cases were inapplicable when fed with the datasets with all features, 2 alternative approaches were employed. Linear regression algorithm was provided with both features from molecular descriptors and generated from pattern recognition from QM9 database (further referenced as LM1) or only features generated from QM9 database(LM2). In this new approach LM1 model 3 is the same as LM2 models 2 and 3. Except for model 3, LM1 scoring results disqualified this prediction method.
Most scoring indicators calculated in this research imply that in case of linear regression the best predicion accuracy is achieved for model 3. It is also worth noting that to obtain somehow applicable results, linear regression models should be provided with features excluding molecular descriptors.

Algorithms optimisation
Due to virtually no difference in scoring among all models, the process was conducted to cover models 1 and 2 of GBR and RFR algorithms. Aforementioned scoring values were obtained by utilising default parameters of machine learning algorithms. Thus it is relevant to determine whether and how varying those parameters impact prediction results. Three parameters were chosen into the parameters tuning. The process was initially assessed by mean absolute error to determine one of the parameters (max_depth in case of GBR and n_estimators for RFR). Mean squared error, maximum error and R 2 were further calculated. Fig. 8 and fig. 9 show selected part of optimisation results of GBR and RFR algorithms accordingly. It is worth pointing that prediction capability of both algorithms benefit from altering parameters but the significant parameters are different. Although the performance improved in both cases, the GBR performed better than RFR.

Conclusions
Presented method of predicting maximum emissions of organic compounds has limited functionality and gives loose insight into the property. There is possibility to polish the method to give better predictions. Some other applications of machine learning in predictions of organic compounds emission wavelengths were published [24,25].
The reduction of original database to only solid state organic compounds resulted in small size of the datasets applied in the study. Due to such limitation of datasets it is most likely that some chemical compounds classes are underrepresented. This phenomenon is well known in cheminformatics as class imbalanced data [26,27,28]. Cheng-Wei et al [25] calculated a curated number of molecular descriptors of solvents which appears to be the correct way to preserve database diversity. Since its vulnerability to database correspondence to compound being assessed, the method should be provided with proper database. Alternatively machine could be trained on the go with a subset of bigger database chosen on compound's similarity(eg. by utilising extended similarity indices [29,30]). With highly probable emergence of new datasets, machine learning based approaches to QSPR will undoubtedly improve their performance.
Induction of quantum properties of compounds' substructures did not improve the accuracy of prediction of emission with RFR and GBR. Generation of quantum-derived features lead to unnecessary computational complexity, thus models 2 -14 appear to be redundant. Although outperformed, MLR was able to give sensible results when fed with only quantum-derived descriptors. With development of new features or with alternative fragments based approach these quantum-chemistry descriptors may play some role in prediction capabilities.
Other machine learning algorithms should be tested in the subsequent research. The appropriate course of improving prediction capabilities would be to introduce predictions of one algorithm into another. The process of optimisation of algorithms was conducted with focus on three parameters. The impact of varying other parameters should also be scrutinised.
Only 2D molecular descriptors were utilised to train machine learning models. There are fields that 3D molecular descriptors perform better then 2D ones [31]. In the study all available descriptors were calculated. Further investigations should focus on analysis of the most principal ones.
The biggest advantage of proposed method is its ability to produce results rapidly. When introduced into web based service, it offers quick assessment of emission property of projected compounds. Since the tool accepts SMILES as input it is easy to use.