Prediction of molecular electronic transitions using random forests

Fluorescent molecules, fluorophores, play essential roles in bioimaging. Attachment of fluorophores to proteins enables observation of the detailed structure and dynamics of biological reactions occurring in the cell. Effective bioimaging requires fluorophores with high quantum yields to detect weak signals. Besides, fluorophores with various emission frequencies are necessary to extract richer information. An essential computational component to discover novel functional molecules is to predict molecular properties. Here, we present statistical machines that predict excitation energies and associated oscillator strengths of a given molecule using a random forest algorithm. Excitation energies and oscillator strengths are directly related to the emission spectrum and the quantum yields of fluorophores, respectively. We discovered specific molecular substructures and fragments that determine the oscillator strengths of molecules from the feature importance analysis of our random forest machine. This discovery is expected to serve as a new design principle for novel fluorophores.


Introduction
Fluorophores play essential roles in diverse fields such as biochemistry, analytical chemistry, medicine, and spectroscopy. [1][2][3][4] For example, they allow us to observe protein-protein interactions and to screen toxic compounds at the molecular level. 5,6 However, only a lim- Seoul-Fluor is developed based on the indolizine core. 7,11 These molecules are designed to have distinct emission frequencies by attaching diverse functional groups to their core structure. 7,9,11 With traditional approaches, finding a novel molecular scaffold takes considerable time and resources to synthesize and verify candidate compounds. 12 Theoretical approaches such as quantum mechanical (QM) calculation or machine learning (ML) may solve this problem more efficiently than conventional approaches. Diverse QM approaches have been developed to predict the photophysical properties of molecules. [13][14][15][16][17][18] Most previous studies focused on predicting the photophysical properties of inorganic solids.
They do not guarantee accurate property prediction of fluorophores used in biomedical researches. [13][14][15][16][17] Sumita and coworkers used density-functional theory (DFT) with the B3LYP hybrid functional and the 3-21G* basis set to evaluate the photophysical properties of molecules. 18 One critical limitation of QM approaches is that they need more considerable computation resources than ML-based approaches. 19 Although there are a few studies to predict the electronic properties of small organic molecules with ML approaches, their prediction accuracy was inferior to that of QM approaches. [19][20][21] Ghosh and coworkers utilized multiple deep-learning (DL) methods to train a machine that predicts the electronic properties of molecules from their structures as inputs. 19 They used multi-layer perceptron (MLP), 21 convolutional neural network (CNN), 22 and deep tensor neural network (DTNN) 17 to predict the density of states from the DFT results of the QM7 23,24 and QM9 25,26 data set. 19 Nakata and coworkers utilized the support vector machine (SVM) and the ridge regression to predict the HOMO-LUMO gaps of molecules. 20 The root mean squared error (RMSE) between QM values and prediction value of their models spans from 0.36 to 0.48 eV. 20 Note that prediction errors of ML studies are similar to that of the time-dependent density functional theory (TD-DFT) benchmarks against experiments. 27,28 To the best of our knowledge, previous ML approaches predicted excitation energies of a molecule, but not oscillator strengths. 19,20 Montavon and coworkers predicted the maximum absorption intensity and corresponding excitation energy from time-dependent Hartree-Fock results using a deep neural network. 21,29 The RMSE of excitation energy of maximum intensity predicted with the existing deeplearning model (1.71 eV) exceeds the range (1.45 eV) of visible light (1.65-3.10 eV). This amount of prediction error appears to be rather large for practical molecular design tasks. 21 Here, we developed statistical models that predict the electronic transitions of a molecule using random forests (RF) algorithm 30 from SMILES input. 31 RF is known to have low overfitting risk and high accuracy. 32 Another advantage of RF over DL is that it facilitates the analysis of relative feature importance. 30,33 In this study, we identified specific fragments that play essential roles in determining oscillator strengths of electronic transitions of molecules.
The performance of our model is encouraging for the discovery of novel fluorophores.
Our model was trained with the TD-DFT results of about half a million molecules.
The highest oscillator strength and associated excitation energy among ten excitation states of molecules are used to train the model. Molecular fingerprints and various molecular property descriptors are used as input features for our model. From the RF results, molecular fragments that play essential roles in determining the molecular electronic properties are identified. Chemical insights into critical factors determining the electronic properties of molecules will be discussed.

Overview of the Workflow
In this study, the target properties for predictions are the excitation energy and oscillator strength of the transition with the maximum oscillator strength among the first ten excited states. The training and test sets contain 450,000 and 50,000 molecules, respectively. A schematic diagram of the workflow is shown in Figure 1.
Two independent RF machines were trained to predict the maximum oscillator strength, and the other RF machine was trained to predict the excitation energy coupled with the maximum oscillator strength. We utilized the RF method implemented in the Scikit-learn 0.21.3 library. 34 The feature vector of a molecule consists of the extended connectivity fingerprints (ECFP), 35 molecular access system (MACCS) keys, 36 and various molecular descriptors provided by RDkit. 37 The obtained feature vectors were used as the input of the RF models.

Molecular Data Set
All data used in this study are from the PubChemQC database. 20 PubChemQC contains the quantum mechanical calculation results of almost four million molecules in PubChem 38 using the TD-DFT methods implemented in GAMESS. 39,40 PubChemQC provides the groundstate electronic molecular structures optimized by DFT at the B3LYP/6-31G* level and the ten low-lying excited state information calculated by TD-DFT with B3LYP/6-31+G*. A half-million molecules were randomly selected from a subset of molecules that consist only of H, B, C, N, O, F, P, S, Cl atoms with no net charge. Among the molecules in the selected set, 450,000 molecules were used as a training set, and the rest as a test set.

Target Properties and Molecular Features
We use the dimensionless quantity oscillator strength (f LU ) that represents the rate of absorption to deal with the excitation intensity. 41 We focus on the transition of the highest oscillator strength because it corresponds to the peak of absorption spectra. Oscillator strength is defined as follows: 42 where m e is the mass of an electron, L and U denote the lower and upper state of transition, and d k and ψ k represent the degeneracy and wave function of state k.
Two fingerprint methods-ECFP 35 and MACCS keys 36 -were used to generate a feature vector from a molecule. In addition to the fingerprints, various molecular descriptors calculated by RDkit were included 37 as well. These molecular features were generated from the SMILES representations of molecules. 31,43 The used molecular descriptors are listed in supplementary information. ECFP is a topological fingerprint for molecular characterization that accounts for the relationships between molecular substructure efficiently. The generation procedure of ECFP systematically records the neighborhood of each non-hydrogen atom into multiple circular layers up to a given diameter as integer identifiers. 35 All these atomcentered substructural identifiers are then mapped into a fixed-length binary representation using a hashing function. 35 The MACCS keys are binary codes that identify the presence of 166 pre-determined molecular sub-structures. Additional molecular descriptors related to the overall topology or the physicochemical properties of a molecule, such as molecular weight or the fraction of sp3 carbon, were included in input features.
Two types of feature vectors were tested in this study. The first vector includes 4096 bits from ECFP4 (of diameter 4), 166 MACCS keys and, 39 molecular descriptors from RDkit, resulting in a 4031-dimensional vector. The second vector includes 16384 bits solely from ECFP6 (of diameter 6) to discover bigger fragments, which determine the electronic properties of molecules. We call the first vector "Vector 1" and the second vector "Vector 2".

Random Forest Regression
To predict the oscillator strength and the excitation energy, random forests regression (RFR), an ensemble learning method was applied as schematically shown in Figure

Feature Importance Analysis
From the trained RF models, the importance of each feature in performing prediction was extracted. Molecular fragments that have high contributions to the oscillator strength of a molecule were identified from the ECFP. Also, critical molecular substructures defined in the MACCS keys and molecular descriptors were identified. To analyze the feature importance, a fingerprint with a longer diameter (ECFP6) and more vector bits (16384) were used to search for huger and more diverse fragments. The four bits with the highest feature importance were selected to discover important fragments that induce high oscillator strength. All fragments corresponding to each selected bit were obtained by the reverse process of the generation. All molecules that include these fragments were then searched from the PubChemQC database.
Finally, eight fragments were selected.

Distribution of the Highest Intensity Transitions from PubChemQC
Analysis of the PubchemQC database shows that it is not trivial to discover favorable flu-

Hyperparameter Optimization
Five hyperparameters of the model were optimized. Two hyperparameters are related to input vector generation: the diameter and the number of bits of ECFP. Three hyperparameters are related to the random forest model: the number of trees, the number of data points in each leaf node, and the ratio of tested features to find the best split. Except for the number of data points per leaf nodes, increasing the size of the model did not improve the performances of models (Table S1 and Table S2). After extensive hyperparameter optimization, ECFP4 with 4096 bits, 1000 trees, 1/3 feature ratio, and 1000 data points per leaf nodes were used to train the model.

Prediction Accuracy of the Highest Intensity Transition
The target properties of our model are maximum oscillator strength and the corresponding excitation energy of a molecule. One of the most important properties to be a favorable fluorophore is high-intensity emission in the visible light range. However, the number of emission spectrum data is still not enough for applying deep-learning or other state-of-theart approaches. 46,47 Additionally, the quantum calculation of emission spectrum is computationally more complicated than that of the absorption spectrum because the geometry optimization of excited states is less accurate than that of the ground state. 47,48 Because the wavelength of absorption and emission are correlated in general, 43 we used the absorption spectrum as a proxy of the emission spectrum.
The correlation coefficient between our prediction and TD-DFT calculation exceeds 0.8 for both oscillator strength and excitation energy( Table 1). The RMSE of oscillator strength and excitation energy predictions with vector1 are 0.088 and 0.448 eV, respectively (Table 1). Figures 3b and 4b show that most molecules are distributed near the diagonal line, representing the perfect prediction. For 96 percent of molecules, the error of oscillator strength between prediction and TD-DFT is less than 0.2 (yellow lines in Figure 3a). In terms of excitation energy prediction, 80 percent of predictions have errors less than 0.5 eV (yellow lines in Figure 4).
To the best of our knowledge, our models are the most accurate models in predicting the oscillator strength and the excitation energy at the highest intensity transition of organic compounds. Two existing approaches predicted excitation spectra based on the HOMO-LUMO gap, 19,20 not the highest intensity transition, which makes a direct comparison with our model hard. Only Montavon and coworkers developed a model that predicts the highest intensity transition by using fully connected hidden layers with a multi-task neural network. 21 They optimized the 3D geometry of molecules using a DFT method starting from SMILES. There are three possible explanations for the high performance of our model. First, the size of our training set is 100 times larger than that of the previous study. It may decrease the risk of overfitting to a small set of molecules. Second, non-Coulomb interactions that were not considered by an atom-pair Coulomb interaction matrices may carry important information. The ECFP, MACCS keys, and molecular descriptors are able to capture local environments including the effects of non-Coulomb interactions. Third, the depth of layers in their model was four, which may be too shallow to learn the electronic properties accurately.
In contrast, the number of leaf nodes and the number of trees in our model are saturated (Table S1 and Table S2).
It is worth noting that the error of our models is comparable to that of TD-DFT quantum calculation against the experiment. 27,28, 49 Fabian and coworkers reported that the mean

Fragments contributing to high oscillator strength
We identified molecular fragments that determine high maximum oscillator strength. To discover bigger and more diverse molecular fragments, we extracted information from the RF results using ECFP6 with 16384 bits. To confirm that the oscillator strength distribu-

Molecular Properties inducing High Oscillator Strength
Our RF models provide which molecular descriptors are significantly related to the electronic property of a molecule. The most critical, highest feature importance, eleven molecular descriptors are illustrated in Figure 7.
In the previous section, it is identified that all fragments inducing high oscillator strength have many π conjugation bonds. It is consistent with the fact that many dye molecules have a large portion of the conjugated π-system. The most critical molecular descriptor determining oscillator strength is the fraction of sp 3 carbon atoms. It is in good agreement with the fact that many known fluorophores have π conjugation systems. The number of stereo-centers and the existence of carbon-carbon double bonds are closely related to the fraction of sp 3 carbon atoms. In other words, it implies that a higher sp 3 carbon fraction leads to a low oscillator strength. The 1 κ shape index, the molecular cyclicity, is ranked in the second place. The index is defined as follows: where A is the number of atoms, and B is the number of bonds. The cyclicity indicates the relative portion of cyclic/branched structures inside a molecule 50,51 The index becomes maximum if a molecule is in a linear shape and minimum if a molecular graph forms a complete graph. The fact that molecular cyclicity is the second most crucial feature indicates that a large portion of aromatic rings is essential for high oscillator strength.
The next critical molecular features are as follows: the maximum partial charge, the minimum absolute partial charge, the existence of -CH 2 N, and the existence of nitrogen connected aromatic rings. These properties appear to contribute to the high oscillator strength through charge separation of a molecule because nitrogen atoms donate lone pair electrons.
It is known that charge separation is closely correlated with light-induced electron transfer. 52

Conclusion
In this work, we developed random forest models that predict the maximum oscillator strength and corresponding excitation energy of a molecule. Our models trained with almost one million DFT results yielded highly accurate predictions. In addition to the high prediction accuracy compared to the existing prediction models, this study presents its importance at one of the few theoretical studies on the prediction of the oscillator strength of a molecule. We further identified molecular fragments and molecular descriptors important for determining the oscillator strength of a molecule. We believe that these findings will serve as a useful guideline for the design of novel fluorophores.

Acknowledgement Supporting Information Available
Hyperparameter Optimization