RegioML: predicting the regioselectivity of electrophilic aromatic substitution reactions using machine learning †

We present RegioML, an atom-based machine learning model for predicting the regioselectivities of electrophilic aromatic substitution reactions. The model relies on CM5 atomic charges computed using semiempirical tight binding (GFN1-xTB) combined with a light gradient boosting machine (LightGBM). The model is trained and tested on 21 201 bromination reactions with 101k reaction centers, which are split into training, test, and out-of-sample datasets with 58k, 15k, and 27k reaction centers, respectively. The accuracy is 93% for the test set and 90% for the out-of-sample set, while the precision (the percentage of positive predictions that are correct) is 88% and 80%, respectively. The test-set performance is very similar to that of the graph-based WLN method developed by Struble et al. ( React. Chem. Eng. , 2020, 5 , 896 – 902) though the comparison is complicated by the possibility that some of the test and out-of-sample molecules are used to train WLN. RegioML out-performs our physics-based RegioSQM20 method (Nicolai Ree, Andreas H. G¨oller, Jan H. Jensen, J. Cheminf. , 2021, 13 , 10) where the precision is only 75%. Even for the out-of-sample dataset, RegioML slightly outperforms RegioSQM20. The good performance of RegioML and WLN is in large part due to the large datasets available for this type of reaction. However, for reactions where there is little experimental data, physics-based approaches like RegioSQM20 can be used to generate synthetic data for model training. We demonstrate this by showing that the performance of RegioSQM20 can be reproduced by a ML-model trained on RegioSQM20-generated data. We present RegioML, an atom-based machine learning model for predicting the regioselectivities of electrophilic aromatic substitution (EAS) reactions. The model relies on ultra fast quantum chemical descriptor calculations combined with an ensemble decision tree variant called light gradient boosting machine (LightGBM). The atomic descriptors are based on CM5 atomic charges obtained from a single conformer embedded with RDKit 29 and single point calculations using the open source semiempirical tight binding method GFN1-xTB. 27 The model is trained and tested on 21 201 bromination EAS reactions corresponding to 101k reaction centers, which are split into a training, test, and out-of-sample datasets with 58k, 15k, and 27k reaction centers, respectively. The accuracy is 93% and 90% for the test and out-of-sample sets, respectively, but this is not a good measure of performance due to the preponderance of non-reactive sites. For example, the precision (the percentage of positive predictions that are correct) is 88% for the test set, but only 80% for the out-of-sample set. The  nal RegioML model released to users is trained on the entire data set and we expect similar performance for molecules in-sample and out-of-sample for this large dataset. For example, for a molecule in this large training set, we expect a precision of 99%, while for a molecule that is similar and out-of-sample, we expect a precision of 88% and 80%, respectively. The test-set performance is very similar to of the graph-based WLN method though the comparison is complicated by the that some of the test and out-of-sample are used to train WLN. RegioML out-performs physics-based method 75%.


Introduction
Many useful reactions are underutilised in synthetic organic chemistry because of an inability to predict the regioselectivity of the reaction, 1 and there is thus an increasing interest in developing regioselectivity prediction methods for such reactions. Recent examples include nucleophilic 2,3 and electrophilic aromatic substitution reactions, [4][5][6][7][8][9] Diels-Alder reactions, 10,11 Heck reactions, 12 radical C-H functionalisation of heterocycles, 13 and reactions such as alkylations, Michael additions, and aldol condensations that proceed through proton abstraction. 14 These methods have been based on either quantum chemical (QM) calculations, 2,5,6 machine learning (ML) trained on experimental data, 8,[10][11][12] or a combination of the two where QM has either provided descriptors for the ML model 3,9 or was used to augment the training data. 13,14 However, these approaches have rarely been compared on the same dataset. 9 In this paper we present a ML model (RegioML) that predicts the regioselectivity of electrophilic aromatic substitution (EAS) reactions using QM charges. We compare the performance of RegioML to that of RegioSQM20 (ref. 6)a QM-based predictor for EAS regioselectivityfor the same dataset and discuss how QM-based predictors can be used to augment sparse experimental datasets. We focus in particular on the precision and recall of these methods for in-and out-of-sample datasets.

Dataset preparation
The reaction data are extracted from Reaxys using a set of queries (see the ESI †) resulting in a total of 30 368 bromination reactions. A thorough dataset curation is then performed to obtain a set of unique SMILES (simplied molecular input line entry system) of the reactants and their corresponding site of bromination, which reduces the total number of reactions to 21 896. For example, a reaction is discarded if there is not an exact one-to-one mapping between the heavy atoms of the reactant and the product excluding the reacting bromine(s), or if a reacting bromine forms a bond with something other than a cyclic sp 2 hybridized carbon atom (accounting for 5314 reactions). Furthermore, reactions with unique reaction IDs in Reaxys but identical reactants are merged (accounting for 3158 reactions).

Quantum chemical calculations
Recently, we published the RegioSQM20 method, 6 which predicts the regioselectivities of EAS reactions from semiempirical calculations of proton affinities. The single-tautomer version of this method is applied to the 21 896 reactions to get proton affinities for all of the unique reaction sites. An extension of this method is also applied in which the RegioSQM20 calculations are followed by single point density functional theory (DFT) calculations in methanol (MeOH, dielectric constant ¼ 33.6) using the PBEh-3c composite electronic structure method 15 and the conductor-like polarizable continuum model 16,17 (C-PCM) as implemented in the quantum chemistry program ORCA version 4.2. 18 A few of the calculations resulted in extreme proton affinities corresponding to outliers in the dataset that complicated the development of regression models. However, the calculated proton affinities for both the original and extended RegioSQM20 calculations follow a Gaussian distribution (see the ESI †), which enables the use of Chauvenet's criterion to remove these outliers. In Chauvenet's criterion, the probability of the farthest point is calculated under the assumption of a Gaussian distribution. If this point is below some predened value, then the point is removed, and the procedure is repeated until no more points are removed. In our dataset, molecules are removed if at least one atom in the molecule has a proton affinity corresponding to a probability below 1%.

Atomic descriptors
There are many possible choices for atomic descriptors, ranging from approximate but very efficient properties such as the Coulomb matrix 19 to accurate but computationally expensive properties such as DFT derived charges and Fukui functions. 9 We investigate seven different atomic descriptors of intermediate complexity as input to the ML models (details of the descriptors are given in Table S2 in the ESI †). The atomic descriptors are developed by Finkelmann et al. 20,21 and are chosen because they have been successfully applied to the prediction of the site of metabolism, 21,22 the hydrogen bond donor and acceptor strengths, 23,24 and Ames mutagenicity of primary aromatic amines. 25 Almost all of the descriptors depend on charge model 5 (CM5) atomic charges, 26 which are obtained from single point calculations using GFN1-xTB as implemented in the open source semiempirical soware package xTB version 6.4.0. 27 This particular charge scheme has been shown to be largely conformation-independent and to correctly reect changes in the chemical environment i.e. substituent effects. 20 Hence, only a single conformer is generated for each molecule using ETKDG versions 3 (ref. 28) with useSmallRingTorsions ¼ True as implemented in RDKit version 2020.09.4. 29 This is the key to using quantum chemical derived descriptors as the computational cost is kept at a minimum (details about computational timings are provided in Results and discussion). The atomic descriptors are generated fully automatically from a SMILES representation of a given molecule.
From the screening of the seven atomic descriptors, we nd that a charge shell descriptor with 5 shells and values sorted according to the Cahn-Ingold-Prelog (CIP) rules is particularly good for predicting the regioselectivity of bromination reactions (see Table S3 and Fig. S4 in the ESI †). An illustration of this 485dimensional descriptor can be seen in Fig. 1.

Dataset splitting
We utilize an unsupervised learning procedure similar to the one found in the MOLAN workow by Sivaraman et al., 30 which resembles the ButinaSplitter from DeepChem. The procedure is as follows: SMILES representations of each molecule are converted into extended connectivity (Morgan) ngerprints 31 with radii of 2 and 1024 bits (ECFP4). The ECFP4 ngerprints are then used to construct a Tanimoto similarity matrix, which enables a clustering of the molecules using the Butina clustering algorithm 32 with a radial cutoff of 0.6 as implemented in RDKit. 29 Clusters with at least 7 molecules are included in the training/test set and otherwise in the out-of-sample set to explore how well the trained machine learning models generalize. For some molecules either the descriptor or RegioSQM20 calculations fail, or the molecules are excluded due to Chauvenet's criterion, which le us with 21 201 reactions corresponding to 100 588 unique reaction sites. Thus, applying the above procedure results in a training/test set and an out-ofsample set of 15 246 and 5955 molecules, which correspond to 73 123 and 27 465 unique reaction sites, respectively.
Uniform stratied and random splits are then used to obtain a 80 : 20 ratio between the training and test sets resulting in 12 196 and 3050 molecules corresponding to 58 384 and 14 739 unique reaction sites, respectively. For the uniform stratied split, each of the individual clusters are randomly split and hereaer combined to ensure that both the training and test sets have similar representations of the underlying data distribution. On the other hand, the random split is indeed completely random with respect to all of the molecules obeying the cluster size cutoff.
As the strategy of this work is to learn and predict using atoms instead of molecules, all of the atomic descriptors for atoms in molecules belonging to the training, test, and out-of-sample sets are collected into different input sets and the corresponding proton affinities or classications into different output sets.
An analysis of the training, test, and out-of-sample datasets can be found in the ESI. †

Machine learning models
In order to learn and predict the regioselectivity of EAS reactions, we explore various regression and classication models with respect to both the experimental and calculated data described above. Initially, a screening of 17 regression models and 13 classication models using PyCaret version 2.3.2 (ref. 33) is conducted (details can be found in the ESI †). This allows us to quickly nd promising machine learning methods, which are then thoroughly examined in terms of nding optimal hyperparameters. The hyperparameter optimizations are carried out using a tree-structured Parzen estimator (TPE) as implemented in Optuna version 2.5.0. 34 All training and evaluation are done using either a normal or a stratied 5-fold cross-validation of the randomly shuffled training set in the case of the regression and classication models, respectively, and only the models with the best validation performance are saved for testing. As shown in Table S4 † the best performance for both regression and classication is the ensemble decision tree variant called light gradient boosting machine (LightGBM) version 3.1.1 (ref. 35) using the sorted-shell atomic descriptors with a shell radius of 5. We refer to this method as simply "LightGBM" hereaer.
Furthermore, we examine the imbalance in the dataset using a "Null model", where all sites are predicted to be non-reactive. And we employ a 1-nearest neighbor (1-NN) classier as a baseline model using the brute-force search algorithm and the Jaccard (also known as Tanimoto) metric as implemented in scikit-learn, 36 which corresponds to a perfect memorization of the training set. 37

Results and discussion
The results we present here only involve the random splitting of the training/test set as similar performances are observed for both the stratied and random splits as seen in Table S5 in the ESI. † Unless otherwise noted, all machine learning models are classiers that output a value between 0 and 1 for each atom, where a value greater than 0.5 indicates that an atom should be reactive.

Data-driven machine learning classifiers
In this section, we train and evaluate machine learning classi-ers on experimental data collected from Reaxys consisting of 58 384, 14 739, and 27 465 unique reaction sites in the training, test, and out-of-sample sets, respectively. The experimental data oen contain just single or a few reported reactive sites among all reaction sites in the reactant, i.e. there are signicantly more negatives (N) than positives (P) in the dataset. Consequently the accuracy (the proportion of correct predictions, ACC) can be a misleading metric. For example, a "Null model", where all sites are predicted to be non-reactive achieves a respectable accuracy of 76% (Table 1) for both the test and out-of-sample sets, but this just reects the fact that 76% of the sites in both the datasets are unreactive. The Matthews correlation coefficient 38 (MCC) is a more robust metric to assess the model performance, since it also considers false positives (FP) and false negatives (FN) in addition to true positives (TP) and true negatives (TN).
are also known as precision, recall, specicity, and negative predictive value, respectively. The MCC values for both the test and out-of-sample sets are zero, which clearly shows that the Null model lacks any real predictive power. As a baseline model, we trained a 1-nearest neighbor (1-NN) classier corresponding to a perfect memorization of the training set. 37 The data show that an impressive-looking 86% accuracy can be achieved for the test set by simple memorisation of the training set. In contrast, the MCC value is only 0.62 for the test set and considerably lower (0.49) for the out-ofsample set. These values primarily reect a low precision where only 71% and 59% of the positive predictions are actually correct.
Our best machine learning model (LightGBM) achieves considerably better precisions of 88% and 80% for the test and out-of-sample sets, respectively. Note that while there is only a 3% drop in accuracy on going from the test set to the out-ofsample set, there is an 8% drop in the precision (and a concomitant drop in the MCC). Hereaer, we refer to this method (i.e. LightGBM trained on experimental data) as RegioML.
The test set MCC of RegioML is virtually identical to the Weisfeiler-Lehman neural network (WLN) architecture specically trained to predict the regioselectivity of EAS reactions by Struble et al. 8 While the precision is 4% higher for WLN, the recall (the fraction of positives that are predicted correctly) is 5% lower, leading to a nearly identical overall performance. WLN performs better on the out-of-sample set, with an MCC value that is nearly identical to that of the test set. However, it should be noted that many of the molecules in these two sets are likely included in the set used to train the WLN method, which is likely to inate the MCC values of WLN. For example, we are able to achieve a MCC value of 0.98 on both the test and out-of-sample sets by training the LightGBM model on the entire collection of data using 10-fold cross-validation (the MCC value is for the best performing model).

Comparison to RegioSQM20
RegioSQM20 predicts the regioselectivity of EASs by nding the reaction center with the highest proton affinity. For computational efficiency, the proton affinities are computed using the semiempirical tight binding method GFN1-xTB and a continuum solvent model of MeOH. The centers with proton affinities within 1 kcal mol À1 of the maximum are considered reactive. This method thus has only a handful of hyperparameters (choice of the computational method, solvent, energy cutoff, and conformational search method) and these are chosen based on a dataset of 532 experimental measurements, some of which are included in the current training set.
For the test set, the recall of RegioSQM20 is similar to that of RegioML (81% vs. 83%), but the precision is signicantly worse (75% vs. 88%). For the out-of-sample set, the recall is somewhat better for RegioSQM20 (80% vs. 76%), but the precision is still worse (73% vs. 80%), leading to a slightly smaller MCC value of 0.69 compared to the 0.72 for RegioML. In contrast to RegioML, the overall performance of RegioSQM20 is very similar for the Table 1 Comparing different methods for predicting the reactivity of the 14 739 unique reaction sites in the test set and the 27 465 unique reaction sites in the out-of-sample set. The reported metrics are accuracy (ACC), Matthew's correlation coefficient (MCC), precision (PPV or positive predictive value), recall (TPR or true positive rate), specificity (TNR or true negative rate), and negative predictive value (NPV). All of the ML models are trained on the 58 384 unique reaction sites in the training set with two exceptions: the "RegioML (all reactions)" model is trained on all of the available data including the training, test and out-of-sample sets, and the "LightGBM (532 reactions)" model is trained on data from the RegioSQM20 paper 6 in which 37 and 74 reactions are part of the test and out-of-sample sets, respectively. These reactions are therefore excluded from the reported statistics test and out-of-sample sets, as one would expect from a more physics based method. However, RegioSQM20 does not offer an advantage over RegioML for the out-of-sample dataset, while being computationally much more demanding (see below). The main advantage of the RegioSQM20 approach is that it may offer an accuracy similar to that of RegioML based on a much smaller training set. Indeed a LightGBM model trained on the same 532 reactions used to develop RegioSQM20 results in MCC values around 0.5 for both the test and out-of-sample sets. While the precision is quite good for this model, the recall is less than 50% due to a large proportion of false negatives. Thus, in cases where little experimental data is available, physics-based methods such as RegioSQM20 are likely to outperform ML based methods, even if the latter rely on quantum descriptors such as atomic charges.
The computational expense of the physics-based methods can be mitigated by using them to generate synthetic data for the machine learning model. Indeed, a LightGBM classier trained on RegioSQM20 predictions for the large training set of 58k reaction centers offers the same performance as that of RegioSQM20 for the test set. Of course, the performance is worse for the out-of-sample set just like for RegioML, but the training dataset can now easily be expanded to ensure a better coverage of chemical space. Furthermore, since RegioSQM20 is not used to offer real-time predictions to a user, more accurate and computationally expensive methods can be explored. For example, the precision of RegioSQM20 can be increased by 6% by using PBEh-3c single point calculations to compute the proton affinitiesan increase that is reected in the corresponding ML model. The overall performance of RegioSQM20 PBEh-3c is now identical to that of RegioML for the out-ofsample dataset, with a MCC value of 0.72.
We also explore whether it is better to predict proton affinities using regression and use them to identify reactive centers, rather than the classication approach. Although the LightGBM RegioSQM20 regression model is able to achieve a mean absolute error (MAE) of 2.00 kcal mol À1 on the test set, its accuracy is not good enough to distinguish between reactive and nonreactive sites compared to that of the LightGBM RegioSQM20 model, as evidenced by the low recall values of 71-73%. However, the LightGBM RegioSQM20 regression model can be used to predict low, medium, or high reactivity as we showed in the RegioSQM20 paper. 6 In fact, by combining the classication model and the regression model, one gets both regioselectivity predictions and a qualitative prediction of the reactivity of a molecule with almost no additional cost as the atomic descriptors only have to be calculated once. Examples of the output of RegioML can be seen in Fig. 2.

Timings
In Table 2 we compare the timings of the RegioSQM20 method, the RegioML model, and the WLN architecture by Struble et al. 8 for the 3050 molecules in the test set. We report the median CPU time and the mean CPU time for predicting the regioselectivity of a molecule with a SMILES representation as input. For the RegioML model and the WLN architecture, the timings cover descriptor creation as well as model prediction for all reaction sites in the given reactant.
The results show that the median CPU time requirements of the RegioSQM20 method is 48 s per molecule on four Intel® Xeon® CPU E5-2643 v3 @ 3.40 GHz cores. The RegioML model is almost 100 times faster on just a single Intel® Xeon® CPU X5550 @ 2.67 GHz core with a median CPU time of less than half a second per molecule. The WLN architecture is able to achieve a mean CPU time of just 0.03 s per molecule on the single Intel® Xeon® CPU X5550 @ 2.67 GHz core. The main reason for the slower performance of RegioML is the GFN1-xTB single point calculations needed to compute the atomic charges. Fig. 2 Examples of the output of RegioML. The scores are obtained by a LightGBM classification model, where values above 50% indicate that an atom should be reactive (green circles). However, atoms with scores above 5% are also highlighted (red circles). The predicted low, medium, or high reactivity is based on the highest proton affinity within the molecule obtained by a LightGBM regression model.

Conclusions and outlook
We present RegioML, an atom-based machine learning model for predicting the regioselectivities of electrophilic aromatic substitution (EAS) reactions. The model relies on ultra fast quantum chemical descriptor calculations combined with an ensemble decision tree variant called light gradient boosting machine (LightGBM). The atomic descriptors are based on CM5 atomic charges obtained from a single conformer embedded with RDKit 29 and single point calculations using the open source semiempirical tight binding method GFN1-xTB. 27 The model is trained and tested on 21 201 bromination EAS reactions corresponding to 101k reaction centers, which are split into a training, test, and out-of-sample datasets with 58k, 15k, and 27k reaction centers, respectively. The accuracy is 93% and 90% for the test and out-of-sample sets, respectively, but this is not a good measure of performance due to the preponderance of non-reactive sites. For example, the precision (the percentage of positive predictions that are correct) is 88% for the test set, but only 80% for the out-of-sample set. The nal RegioML model released to users is trained on the entire data set and we expect similar performance for molecules in-sample and out-ofsample for this large dataset. For example, for a molecule in this large training set, we expect a precision of 99%, while for a molecule that is similar and out-of-sample, we expect a precision of 88% and 80%, respectively. The test-set performance is very similar to that of the graph-based WLN method developed by Struble et al. 8 though the comparison is complicated by the possibility that some of the test and out-of-sample molecules are used to train WLN. RegioML out-performs our physicsbased RegioSQM20 method 6 where the precision is only 75%. Even for the out-of-sample dataset, RegioML slightly outperforms RegioSQM20. The good performance of RegioML and WLN is in large part due to the large datasets available for this type of reaction. For example, if we retrain the RegioML model on the same 532 reactions we used to develop RegioSQM20, the performance is much worse due to a large increase in the false negative rate leading to a recall (the percentage of positives that are predicted correctly) below 50% compared to the 80% for RegioSQM20. Thus, one use of physics-based approaches such as RegioSQM20 is to generate synthetic data for the ML model for reactions where there is little experimental data. We demonstrate this by showing that the performance of RegioSQM20 can be reproduced by a ML model trained on RegioSQM20-generated data.

Author contributions
AG and JHJ developed the idea and lead the project. NR wrote all the code and performed all the calculations. All authors contributed to the data analysis. All authors read and approved the nal manuscript.

Funding
This work was supported by Bayer AG.

Conflicts of interest
The authors declare that there are no competing interests.