Improved Selection of Rare Reactions in Template-Based Retrosynthesis Predictions

Identifying synthetic routes for molecules of interest is a crucial step when discovering new drugs or materials. To find synthetic routes, we can use computer-assisted synthesis planning using expansion policy networks trained on reaction templates extracted from patents and the literature. However, experience has shown that these networks are biased towards frequently reported reactions. This study shows that changing the molecular representation from an extended-connectivity fingerprint to a simple graph representation can increase the accuracy for templates used less than five times by 5.08.5% points. We also illustrate that a simple oversampling of the training set yielded a top-1 accuracy increase in the 17-20% point range for templates used five times or less. Introduction In recent years have computer-assisted synthesis planning has undergone significant changes [1] . Research are moving away from traditional expert systems, built around hand-encoded transformation rules to automated search algorithms guided by artificial intelligence trained on millions of known reactions [2]. The paper from Segler et al. [3] illustrates that reaction templates extracted from large reaction databases coupled with a Monte Carlo tree search (MCTS) can be used as an efficient retrosynthetic planning tool. The MCTS algorithm consists of four phases: selection, expansion, rollout, and update. In the expansion phase, new nodes are added to the search tree. The new nodes are added based on the expansion procedure illustrated in Figure 1. The target molecule is parsed through an expansion policy network, which is a neural network trained to recommend retrosynthetic templates. Typically, the Figure 1 Illustration of the expansion procedure used in the Monte-Carlo tree search. First, the expansion policy network ranks the template library by returning the probability of template (T) given the target molecule (m). The top-50 templates are applied to the target molecule to give potential reactions, which are then deemed feasible or not by the in-scope filter. Thesis Chapter ChemRxiv.org preprint December 9, 2021 Page 2 of 15 search space of the MCTS is limited by only including the 50 most probable templates (or the number of templates it takes to reach a cumulative probability of 0.995) [3, 4].The reaction databases used to train the expansion policy network are often very imbalanced, i.e. some templates are used frequently, while many are rarely used. In the US Patent Office reaction database (USPTO) curated by Thakkar et al., 10% of all recorded reactions use transformations that is only reported five times or less [5, 4]. However, the infrequently used transformations account for roughly 60% of all unique templates found in the database. In the other end, 50% of the reactions use templates that are reported more than 100 times but the frequent transformations only account for 2% of the unique templates. The imbalanced nature of the reactions databases results in expansion policy networks biased towards frequently used templates. For example, the expansion policy network published with AIZynthFinder [6] has a top-1 accuracy of 65.7% for reactions that use templates that occur more than 100 times in the USPTO database, but only a top-1 accuracy of 34.8% for reactions that use templates recorded five times or less. In essence, the expansion policy networks will perform worse for less frequently used templates. There have been proposed multiple ideas to address the low-data issue. Some solutions rely on template applicability models and knowledge transfer to increase accuracy in the low-data domain [7, 8]. Others use structural information about the templates to allow the models generalizing information across template classes [9]. Recently there have been growing evidence that properties computed using quantum mechanics (QM) methods can offer additional information about molecules that are difficult to learn from the molecular graph representation itself without many examples [10, 11]. By relying on more abstract similarities between products, QM methods are powerful tools and can be used to extract reactivity trends for organic reactions [12]. For example, QM methods can be used to identify reactive atomic sites via local reactivity descriptors [13]. Local reactivity descriptors, such as the Fukui functions, indicate how the electron density of a given molecule responds to an attack of another and have been applied to identify which sites are most suitable for electrophilic or nucleophilic attack [14, 15]. A set of such local chemical descriptors can therefore carry essential information about the chemical reactivity. In this paper, we test the hypothesis that we can reduce the skewed performance of expansion policy networks by introducing semi-empirical QM properties. The QM data is embedded in the model through graph convolution. While we find that this approach improved performance we show that the improvement comes from the graph convolutional molecular representation instead of the fingerprint representation used originally. Secondly, we show that the skewed performance can be decreased further by using oversampling of rare templates. Methods Template library. The database of reaction templates was obtained from the publicly available US Patent Office dataset [5] as prepared by Thakkar et al. [4]. The library consists of 911,869 reactions distributed across 46,695 unique templates. Training, validation, and test sets were constructed by randomly splitting the 911,695 reactions in USPTO database in a 90/5/5 split. The validation and training set each contains 45,594 reactions, while the training set contains 820,682 reactions. The 46,695 unique templates in the USPTO database were represented as binarized labels. The reactions are encoded as pairs of reaction SMARTS and product. The models are trained to assign the highest probability to the reaction SMARTS given the corresponding product. Thesis Chapter ChemRxiv.org preprint December 9, 2021 Page 3 of 15 Template recommender ECFP neural network. The ECFP-based recommender model is published with AIZynthFinder [4, 19, 6]. The network is identical to the architecture of the dense layer following the graph convolutional layers used in the graph recommender neural networks. The input is a 2048-bit ECFP4 fingerprint computed using the Morgan algorithm in RDKit [16, 20]. The input is passed to a fully connected layer with 512 nodes with an ELU [21] activation and a dropout rate of 0.4. Finally, a SoftMax activation yield probabilities of each template in the library. The ECFP recommender model is trained using the exact same train, validation, and test set we used to train the graph convolutional neural networks. Template recommender graph neural network. We have trained a series of graph recommender neural networks that differ in how the information is embedded into the molecular graph and how the convolutional operation is performed. We use a simple convolutional operator: the graph convolution (GCN) introduced by Kipf et al. [22]. The network architecture is illustrated in Figure 2. The network can take two inputs: a graph that contains simple atom descriptors extracted with RDKit, and a vector that contains atomic properties calculated with SQM. A more detailed description of the input used for each model can be found in SI. The RDKit data can optionally be parsed through a fully connected layer (red layer in Figure 2) to reduce the RDKit feature input dimensionality before it is concatenated with the QM data. After concatenating the RDKit and QM data, the size of the atom embedding has to match the dimensions of the Figure 2 Illustrates the graph neural network architecture. The network takes molecules represented as graphs as input. The RDKit and QM descriptors are concatenated (⊕) before being parsed to the graph neural network. The dense layer (in the red box), before concatenation, is optional and is used to compress the RDKit input. Thesis Chapter ChemRxiv.org preprint December 9, 2021 Page 4 of 15 convolutional layer. Thus, the atom embedding is expanded in another fully connected layer with a ReLU activation to match the convolutional operator. Next, the graph is parsed through three layers of graph convolutions and turned into a single vector representation by a global max-pooling layer. Hereafter, the vector is passed to a dense neural network with an architecture identical to the one used by Thakkar et al [4]. Training of the networks was allowed to continue for a maximum of 200 epochs. Training is stopped early if the validation loss do not improve for ten epochs. During training, the validation loss is monitored, and if the loss hit a plateau for five epochs (defined as no improvement greater than 10) the learning rate is halved. The implementation of the graph neural networks is all done in PyTorch Geometric [23]. The training was carried out using PyTorch Lightning using an Adam optimizer [24] with an initial learning rate of 0.001. The loss was computed with the categorical crossentropy loss. Creating quantum chemical data. We generated a database containing atomic/bond properties calculated by semi-empirical quantum mechanics for all 911,869 reactions. The properties was computed using a simple automated computing workflow. The workflow starts from a SMILES string representing the product and then we create a random 3D conformer using the ETKDGv3 algorithm as implemented in RDKit [16, 17]. The conformer was then subject to a single-point and Fukui GFN2-xTB [18] calculation from which we extracted the local atomic QM properties. We extracted the partial charge (q), Fukui index (f+, f0, f-), polarizability (α), and for each atom sum the Wiberg bond orders (sBO), which previously have been employed to augment machine learning representations [10, 11]. Results and Discussion Performance of the recommender neural networks. We trained several graph template recommender models, differentiated by the input information available to each model. The ECFP policy expansion network according to Genheden et al. converged within 15-20 epochs, which is quicker than the graph neural networks [19]. However, all graph models converged within 100-150 epochs as seen in SI Figure S 1. To establish a baseline for the graph neural network type of model, we train a graph neural network that essentially follows an implementation similar to DeepChem's GCN networks [13]. The node features are extracted directly from the RDKit molecular object graph of the product. They include information about the atom type, number of directly bonded neighbours, formal charge, Table 1 Test set accuracy for the different graph neural networks. Model Top-1 (%) Top-5 (%) Top-10 (%) Top-50 (%) ECFP* 58.7 83.5 88.7 94.8 GCN (RDKit) 58.6 84.5 89.9 95.7 GCN (RDKit + QM) 58.7 84.5 89.8 95.8 GCN (12 RDKit) 58.6 84.4 89.7 95.6 GCN (6 RDKit, 6 QM) 57.5 83.7 89.4 95.5 GCN (Atom type) 48.8 74.8 82.1 91.9 GCN (Atom type + QM) 56.5 82.7 88.4 95.0 * The ECFP model is the public recommender model trained on the USPTO database, which is downloaded as part of AIZynthFinder. Thesis Chapter ChemRxiv.org preprint December 9, 2021 Page 5 of 15 hybridization, automaticity, ring information, and atomic mass. Which is very similar to the atomic information used to construct the ECFP representation [20]. We call this model GCN (RDKit) as it only contains atom features extracted directly from RDKit. As illustrated in Table 1, the mean accuracy of the baseline GCN (RDKit) model is quite similar to the public ECFP recommender model. If anything, the top-5, top-10, and top-50 accuracy of the GCN (RDKit) model slightly outperform the ECFP recommender model with approximately 1% point. However, as stated in the introduction, ML models generally perform worse for less frequent data. Due to the very imbalanced nature of the data, the mean accuracy across the entire test hides details in the performance. As Illustrated in Figure 3, which shows the mean accuracy for the cumulative template occurrence, there is a distinct drop in accuracy for less frequent reactions. The top-1 accuracy of the templates that occur less than or equal five times is roughly half as accurate as the templates that occur more than 100 times: 34.8% compared to 65.7% for the ECFP model, and 33.5% compared to 65.0% for the and GCN (RDKit). Figure 3 illustrates that the insignificant differences in the mean accuracy hid some noteworthy differences between the two models. It is apparent that the GCN (RDKit) model generally performs better for the rare reactions, and this trend is most evident for the top-50 accuracy where we find a difference of 8.5% between the two models for templates that occur less than or equal to five times. This slight rise in accuracy for the GCN (RDKit) model can likely be attributed to the convolutional layers, which in essence, build up a custom fingerprint that is trained towards recommending which template to use. Next, we investigate what happens if QM properties are included in the node feature vector. Therefore, we extend the GCN (RDKit) atom feature input with six additional GFN2-xTB QM features: the partial charge (q), the Fukui index (f+, f0, f-), the sum of Wiberg covalent bond orders (sBO), and the polarizability (α). We could have calculated many different molecular and atomic properties, which Figure 3 The top-1, top-5, top-10, and top-50 cumulative average accuracy for the ECFP and graph models. The average accuracy is calculated for the set of reactions (only test set reactions), for which the template (t) occur in the USPTO database less than or equal X times (from 3-1000 occurrences). Thesis Chapter ChemRxiv.org preprint December 9, 2021 Page 6 of 15 could have been used as descriptors. However, many of the properties may carry redundant information, and testing all combinations of available descriptors at different levels of theory is beyond the scope of this study. When including QM properties, we remove the corresponding RDKit properties, i.e. the formal charge and number of directly bonded neighbours. The extended model, we give the name GCN (RDKit + QM). The mean accuracies for the GCN (RDKit) and GCN (RDKit + QM) model (listed in Table 1) are practically identical with only insignificant changes of maximally 0.1% point. The same trend is observed in the cumulative accuracy in Figure 3 where it is hard to distinguish the two models. To ensure that result is not just a consequence of the uneven distribution of RDKit and QM atom features. The input node embedding in the GCN (RDKit + QM) contains 32 values in total, and only six of those are QM properties. We train two extra models where the RDKit part of the GCN (RDKit) and GCN (RDKit + QM) models are compressed into a six and 12-dimensional vector, respectively. By compressing the RDKit part of the GCN (RDKit + QM) before concatenating with the six QM properties, we transform the node embedding into a 12-dimensional feature vector with equal weight on the RDKit and QM features. The compression is performed by passing the RDKit input feature vector through a dense linear layer with a ReLU activation function (red layer in Figure 2) before concatenating with the QM descriptors. Again, we observe no significant differences between the two models when comparing the mean accuracies in Table 1 or the cumulative accuracy in Figure 3. If anything, the GCN (12 RDKit) model performs slightly better than the GCM (6 RDKit, 6 QM). These findings refute our hypothesis that the lack of differences is a result of uneven distribution of RDKit and QM features. To better understand what the QM properties do to the model, we train two additional models. The first model we name GCN (Atom type) and is the most basic model we trained. The GCN (Atom type) model only contains node features that describe the kind of atom (same atom encoding as the GCN (RDKit) model). The second model is called GCN (Atom type + QM) extends the GCN (Atom type) model by including the six QM atom properties. Unsurprisingly, the GCN (Atom type) model performs the worst since it contains much less information about the atoms. The top-1 mean accuracy (Table 1) drops from 58.6% to 48.8% compared to the GCN (RDKit) model. However, when the QM properties are included, the top-1 accuracy increases to 56.5%, which is only a decrease of 2.1% points compared to the GCN (RDKit) model and only a drop of 0.7% points for the top-50 accuracy. This suggests that there is a significant correlation between the RDKit features and the QM property features we chose. Such correlations can explain why we see no differences in accuracy between the GCN (RDKit) and GCN (RDKit + QM) model since they, in essence, carry similar information about each atom. It also explains why the GCN (Atom type + QM) and the GCN (RDKit) model almost perform the same. Applicability of the predicted templates. The accuracy of the expansion network only reflects the network's ability to match one product to one template. In practice, when the network is applied to retrosynthetic route planning, the top-50 templates (or a cumulative probability of 0.995) are typically applied to the target product molecule [14, 1]. However, even a highly accurate model is not guaranteed to recommend applicable templates. This problem gives rise to high failure rates for the proposed templates due to the template's inability to match a substructure in the target for which it is was predicted. Table 2 shows the average number of templates it takes to reach cumulative probability above 0.995, and the number of those that actually can be applied to the target molecule. Here we notice differences between the ECFP recommender model and the GCN (RDKit) model. The most notable is the average number of templates that needed to reach a probability of 0.995 is 42.3 Thesis Chapter ChemRxiv.org preprint December 9, 2021 Page 7 of 15 for the ECFP model compared to 31.9 for the GCN (RDKit) model. Which could indicate that the GCN (RDKit) model is slightly more certain about the top-1 prediction compared to the ECFP model, and therefore does not need many extra templates to reach a cumulative probability of 0.995. This is confirmed by comparing the average top-1 probability for the GCN (RDKit) and ECFP models which is 0.69 and 0.63, respectively. Because we include less templates we also see a slight decrease in the number of applicable templates from 16.3 to 12.7. However, the success rate of applicable templates in the GCN (RDKit) and ECFP models is approximately 40% in both cases. By comparing the GCN (RDKit) and GCN (RDKit + QM) we see that the average number of templates needed to reach a cumulative probability of 0.995 and the number of applicable templates is practically unaffected by including our chosen QM properties. The maximal number of applicable templates we find is 16.3 for the ECFP model. To ensure that it is not just the average number of applicable templates for the target molecule (of the 46,695 unique templates), we apply all templates to all target molecules. The distribution of applicable templates in the test set for each target molecules are seen in the SI Figure S 2, but on average can 281 templates be applied to the target. The network thus still provides a clear enrichment of applicable templates in the top prioritized templates. The fact that we, on average, can apply 281 of the 46,695 templates raises an essential question about the training setup. For retrosynthetic prediction, we are not in general interested in training a network that only recreates the one-to-one mapping found in the training set (target molecule to a template). Instead, we want the network to propose several applicable templates, preferably a mix of rare and common templates. However, as training is set up for most data-driven retrosynthetic tools available, the network is, during training, attempting to recreate the one-to-one mapping. This is due to the choice of the loss function, i.e. the cross-entropy loss, which only rewards the network if it succeed in an exact one-to-one mapping [1, 14]. The discrepancy between the training objective and how we are applying the model means, in the extreme case, that we are relying on coincidences that there is more than one applicable template among the top-50 templates. The training setup will force the network to get as close as possible to predict precisely one template with more certainty. During training the network will attempt to make the top-1 probability as close as possible to 1. Thus, we will likely observe that the more accurate (or specific) the model becomes, the fewer templates the model recommend (given the same probability cut-off). Because the model is more confident in its prediction (top-1 probability is closer to 1), we needed fewer templates to reach a cumulative probability of 0.995. Table 2 Number of templates with a cumulative probability above 0.995, and how many of these can actually be applied. Model Avg. # templates Avg. applicable templates ECFP* 42.3 16.3 GCN (RDKit) 31.0 12.7 GCN (RDKit + QM) 30.1 12.1 GCN (RDKit), oversample 10 27.7 11.4 GCN (RDKit), oversample 50 27.1 10.3 * The ECFP model is the public recommender model trained on the USPTO database, which is downloaded as part of AIZynthFinder. Thesis Chapter ChemRxiv.org preprint December 9, 2021 Page 8 of 15 This trend is observed when we compare the average templates of the GCN (RDKit) and ECFP recommender model in Table 2. The GCN type of networks thus generally offers an advantage over the ECFP network, with a better top-50 accuracy for less frequent reactions. But the inclusion of QM data does not offer any improvements over the GCN (RDKit) model that only uses RDKit properties. Random oversampling/undersampling. As an alternative to adding information on the atomic level, we also tested oversampling of seldomly used templates. The reason for oversampling/undersampling in machine learning is that classifiers (as the recommender model) are typically more sensitive towards the majority class. This can be illustrated by analyzing the mean template occurrence, which describes the average reported instances of the templates used/predicted of the reactions in question. For example, two reactions rely on templates reported 50 and 100 times, resulting in a mean template occurrence of 75. In the USPTO test set, the mean template occurrence is 1112, while the top-1 mean template occurrence for the GCN (RDKit) model is 1742. This indicates that the trained model generally predicts more frequently used templates than the underlying dataset dictates. To overcome the bias, more weight can be assigned to the minority classes in the training set by sampling them with replacement or reducing the importance of the majority classes in training set by undersampling, which means that you only choose some samples from the majority class. The objective of oversampling/undersampling is to get a more even accuracy across all types of reactions, which would be a more horizontal line for the cumulative template occurrence in Figure 5. We train the GCN (RDKit) model on three different training sets to test how random under/oversampling influences the recommender models performance. The three different training sets all have varying levels of oversampling and undersampling. The first training set is created by oversampling. We sample ten reactions (with replacement) for each template class that occur ten or fewer times in the training set (35,355 of the 46,673 templates). The training set contains 1,022,940 reactions. The second training set is sampled identically to the first one, but instead, we use a cut-off of 50 (44,636 of the 46,673) occurrences. With a cut-off of 50, we get a training set with 2,717,960 reactions. The third training set is curated by both underand oversampling the reactions, so the number of reactions in each template class is equal. We do this by sampling 30 reactions, with replacement, from each template class, resulting in a training set size of 1,400,190 reactions. Figure 5 clearly illustrates that by oversampling the less frequent reactions, we see a significant increase in the accuracy for the less used templates. The top-1 accuracy for reactions that occur less than or equal to five times increases from 33.8% to 50-54%, depending on how the test set sampling is performed. However, the way that the model behaves for frequent reactions is quite different. The model where each template is represented by exactly 30 reactions (grey line) shows a dramatic decrease in accuracy for frequently used templates. Also, the average template occurrence for the model is only 78, which suggests that the model is heavily biased towards less frequent reactions. The strong bias indicates that we have thrown too much information away due to a very aggressive undersampling of common templates. The two other models, that oversample the rare reactions (cyan and yellow lines), both clearly exhibit a more horizontal trend. The horizontal trend indicates that the accuracy are more balanced across all templates than the model trained on the unprocessed training set. Another way to illustrate this is by comparing the average top-1 template occurrence which is 1492 and 1389 for the slightly oversampled (cut-off 10, cyan line) and the “heavily” oversampled (cutoff of 50, yellow line), respectively. These average template occurrences are much better in line with the true values of the test set, albeit there is still room for improvement. The slightly over oversampled Thesis Chapter ChemRxiv.org preprint December 9, 2021 Page 9 of 15 model performs worse on less frequently used templates then the “heavily” oversampled model. However, around the a template of 100 occurrence the “heavily” oversampled model falls below the model trained on the standard training set. The drop below the standard model means that even though the accuracy for rare reactions is better the overall accuracy of the model is a bit worse. The applicability of the templates recommended by the slightly and “heavily” oversampled models are pretty similar. As seen in Table 2, both models, on average, needs the top-28 templates to reach a Figure 5 Cumulative accuracy for the GCN (RDKit) model trained on four different training sets. GCN (RDKit) – Up10 and Up50 are the “slightly” and “heavily” oversampled training set, while the GCN (RDKit) – Up equal model is trained on a training set where the samples were equally distributed. Figure 4 The top-50 cumulative SoftMax probability. Up10 and Up50 refer to the “slightly” and “heavily” oversampled training sets. Thesis Chapter ChemRxiv.org preprint December 9, 2021 Page 10 of 15 cumulative probability of 0.995 (or maximally top-50). But among the recommended templates, only 10 and 11 of them are applicable. Figure 4 show the average cumulative probability of the models. It illustrates that the increase in cumulative probability is more significant when fewer templates are included for the oversampled graph methods. It confirms the tendency that increased model confidence will result in fewer templates needed to reach the cumulative probability of 0.995 (or maximally 50 templates). Consequently, it will recommend fewer applicable templates. Compared to the ECFP recommender model, the two oversampled models will find 5 and 6 fewer applicable templates from the USPTO test set. Conclusions We investigated how a selection of semi-empirical QM properties would influence the accuracy of the expansion policy networks used for retrosynthetic planning. We used graph representations to train graph neural networks to incorporate the atomic QM properties into a meaningful molecular descriptor. We trained the neural graph networks with and without QM properties and found that the selected semi-empirical QM properties did not affect the accuracy or template applicability. It is important to highlight that we only tested a few QM properties and did not perform an exhaustive test of QM properties. Other properties could potentially yield a different conclusion as illustrated by Stuyver et al. [11]. By changing from the ECFP representation to a graph representation, the top-5, top-10, and top-50 accuracies increased by 5.5-8.5% points. When switching to a graph representation, we observed that the number of templates needed to reach a cumulative probability of 0.995 decreased from 42.2 to 31.3. This decrease resulted in a slight reduction in the number of applicable templates (16.3 to 12.7), which made us question the contradictory way we train and apply the expansion policy network. When the expansion policy network is trained, we target a one-to-one mapping between the target molecule and template documented in the training set. However, when the network is applied for retrosynthesis prediction, we expect the network to return multiple applicable templates for the target molecule. We also tested if oversampling/undersampling can increase the accuracy of rare templates without decreasing the accuracy of the frequently used ones. These relatively simple techniques result in a substantial top-1 accuracy increase of approximately 20% for templates represented five times or less in the USPTO reaction dataset. We tested three different ways of oversampling/undersampling and found that not all methods perform equally well. If the templates are sampled evenly, you see a severe decrease in accuracy for templates used more than 100 times. But if you only oversample templates observed 50 times or less, we do not observe the same severe decrease in accuracy for frequently used templates.


Introduction
In recent years have computer-assisted synthesis planning has undergone significant changes [1] . Research are moving away from traditional expert systems, built around hand-encoded transformation rules to automated search algorithms guided by artificial intelligence trained on millions of known reactions [2]. The paper from Segler et al. [3] illustrates that reaction templates extracted from large reaction databases coupled with a Monte Carlo tree search (MCTS) can be used as an efficient retrosynthetic planning tool.
The MCTS algorithm consists of four phases: selection, expansion, rollout, and update. In the expansion phase, new nodes are added to the search tree. The new nodes are added based on the expansion procedure illustrated in Figure 1. The target molecule is parsed through an expansion policy network, which is a neural network trained to recommend retrosynthetic templates. Typically, the search space of the MCTS is limited by only including the 50 most probable templates (or the number of templates it takes to reach a cumulative probability of 0.995) [3,4].The reaction databases used to train the expansion policy network are often very imbalanced, i.e. some templates are used frequently, while many are rarely used. In the US Patent Office reaction database (USPTO) curated by Thakkar et al., 10% of all recorded reactions use transformations that is only reported five times or less [5,4]. However, the infrequently used transformations account for roughly 60% of all unique templates found in the database. In the other end, 50% of the reactions use templates that are reported more than 100 times but the frequent transformations only account for 2% of the unique templates. The imbalanced nature of the reactions databases results in expansion policy networks biased towards frequently used templates. For example, the expansion policy network published with AIZynthFinder [6] has a top-1 accuracy of 65.7% for reactions that use templates that occur more than 100 times in the USPTO database, but only a top-1 accuracy of 34.8% for reactions that use templates recorded five times or less. In essence, the expansion policy networks will perform worse for less frequently used templates. There have been proposed multiple ideas to address the low-data issue. Some solutions rely on template applicability models and knowledge transfer to increase accuracy in the low-data domain [7,8]. Others use structural information about the templates to allow the models generalizing information across template classes [9].
Recently there have been growing evidence that properties computed using quantum mechanics (QM) methods can offer additional information about molecules that are difficult to learn from the molecular graph representation itself without many examples [10,11]. By relying on more abstract similarities between products, QM methods are powerful tools and can be used to extract reactivity trends for organic reactions [12]. For example, QM methods can be used to identify reactive atomic sites via local reactivity descriptors [13]. Local reactivity descriptors, such as the Fukui functions, indicate how the electron density of a given molecule responds to an attack of another and have been applied to identify which sites are most suitable for electrophilic or nucleophilic attack [14,15]. A set of such local chemical descriptors can therefore carry essential information about the chemical reactivity. In this paper, we test the hypothesis that we can reduce the skewed performance of expansion policy networks by introducing semi-empirical QM properties. The QM data is embedded in the model through graph convolution. While we find that this approach improved performance we show that the improvement comes from the graph convolutional molecular representation instead of the fingerprint representation used originally. Secondly, we show that the skewed performance can be decreased further by using oversampling of rare templates.

Methods
Template library. The database of reaction templates was obtained from the publicly available US Patent Office dataset [5] as prepared by Thakkar et al. [4]. The library consists of 911,869 reactions distributed across 46,695 unique templates. Training, validation, and test sets were constructed by randomly splitting the 911,695 reactions in USPTO database in a 90/5/5 split. The validation and training set each contains 45,594 reactions, while the training set contains 820,682 reactions. The 46,695 unique templates in the USPTO database were represented as binarized labels. The reactions are encoded as pairs of reaction SMARTS and product. The models are trained to assign the highest probability to the reaction SMARTS given the corresponding product.
Template recommender ECFP neural network. The ECFP-based recommender model is published with AIZynthFinder [4,19,6]. The network is identical to the architecture of the dense layer following the graph convolutional layers used in the graph recommender neural networks. The input is a 2048-bit ECFP4 fingerprint computed using the Morgan algorithm in RDKit [16,20]. The input is passed to a fully connected layer with 512 nodes with an ELU [21] activation and a dropout rate of 0.4. Finally, a SoftMax activation yield probabilities of each template in the library. The ECFP recommender model is trained using the exact same train, validation, and test set we used to train the graph convolutional neural networks.
Template recommender graph neural network. We have trained a series of graph recommender neural networks that differ in how the information is embedded into the molecular graph and how the convolutional operation is performed. We use a simple convolutional operator: the graph convolution (GCN) introduced by Kipf et al. [22].
The network architecture is illustrated in Figure 2. The network can take two inputs: a graph that contains simple atom descriptors extracted with RDKit, and a vector that contains atomic properties calculated with SQM. A more detailed description of the input used for each model can be found in SI. The RDKit data can optionally be parsed through a fully connected layer (red layer in Figure 2) to reduce the RDKit feature input dimensionality before it is concatenated with the QM data. After concatenating the RDKit and QM data, the size of the atom embedding has to match the dimensions of the Figure 2 Illustrates the graph neural network architecture. The network takes molecules represented as graphs as input. The RDKit and QM descriptors are concatenated (⊕) before being parsed to the graph neural network. The dense layer (in the red box), before concatenation, is optional and is used to compress the RDKit input. convolutional layer. Thus, the atom embedding is expanded in another fully connected layer with a ReLU activation to match the convolutional operator. Next, the graph is parsed through three layers of graph convolutions and turned into a single vector representation by a global max-pooling layer. Hereafter, the vector is passed to a dense neural network with an architecture identical to the one used by Thakkar et al [4]. Training of the networks was allowed to continue for a maximum of 200 epochs. Training is stopped early if the validation loss do not improve for ten epochs. During training, the validation loss is monitored, and if the loss hit a plateau for five epochs (defined as no improvement greater than 10 -6 ) the learning rate is halved. The implementation of the graph neural networks is all done in PyTorch Geometric [23]. The training was carried out using PyTorch Lightning using an Adam optimizer [24] with an initial learning rate of 0.001. The loss was computed with the categorical crossentropy loss.
Creating quantum chemical data. We generated a database containing atomic/bond properties calculated by semi-empirical quantum mechanics for all 911,869 reactions. The properties was computed using a simple automated computing workflow. The workflow starts from a SMILES string representing the product and then we create a random 3D conformer using the ETKDGv3 algorithm as implemented in RDKit [16,17]. The conformer was then subject to a single-point and Fukui GFN2-xTB [18] calculation from which we extracted the local atomic QM properties. We extracted the partial charge (q), Fukui index (f+, f0, f-), polarizability (α), and for each atom sum the Wiberg bond orders (sBO), which previously have been employed to augment machine learning representations [10,11].

Results and Discussion
Performance of the recommender neural networks. We trained several graph template recommender models, differentiated by the input information available to each model. The ECFP policy expansion network according to Genheden et al. converged within 15-20 epochs, which is quicker than the graph neural networks [19]. However, all graph models converged within 100-150 epochs as seen in SI Figure S 1.
To establish a baseline for the graph neural network type of model, we train a graph neural network that essentially follows an implementation similar to DeepChem's GCN networks [13]. The node features are extracted directly from the RDKit molecular object graph of the product. They include information about the atom type, number of directly bonded neighbours, formal charge, hybridization, automaticity, ring information, and atomic mass. Which is very similar to the atomic information used to construct the ECFP representation [20]. We call this model GCN (RDKit) as it only contains atom features extracted directly from RDKit. As illustrated in Table 1, the mean accuracy of the baseline GCN (RDKit) model is quite similar to the public ECFP recommender model. If anything, the top-5, top-10, and top-50 accuracy of the GCN (RDKit) model slightly outperform the ECFP recommender model with approximately 1% point. However, as stated in the introduction, ML models generally perform worse for less frequent data. Due to the very imbalanced nature of the data, the mean accuracy across the entire test hides details in the performance. As Illustrated in Figure 3, which shows the mean accuracy for the cumulative template occurrence, there is a distinct drop in accuracy for less frequent reactions. The top-1 accuracy of the templates that occur less than or equal five times is roughly half as accurate as the templates that occur more than 100 times: 34.8% compared to 65.7% for the ECFP model, and 33.5% compared to 65.0% for the and GCN (RDKit). Figure 3 illustrates that the insignificant differences in the mean accuracy hid some noteworthy differences between the two models. It is apparent that the GCN (RDKit) model generally performs better for the rare reactions, and this trend is most evident for the top-50 accuracy where we find a difference of 8.5% between the two models for templates that occur less than or equal to five times. This slight rise in accuracy for the GCN (RDKit) model can likely be attributed to the convolutional layers, which in essence, build up a custom fingerprint that is trained towards recommending which template to use. Next, we investigate what happens if QM properties are included in the node feature vector. Therefore, we extend the GCN (RDKit) atom feature input with six additional GFN2-xTB QM features: the partial charge (q), the Fukui index (f+, f0, f-), the sum of Wiberg covalent bond orders (sBO), and the polarizability (α). We could have calculated many different molecular and atomic properties, which   Table 1) are practically identical with only insignificant changes of maximally 0.1% point. The same trend is observed in the cumulative accuracy in Figure 3 where it is hard to distinguish the two models.
To ensure that result is not just a consequence of the uneven distribution of RDKit and QM atom features. The input node embedding in the GCN (RDKit + QM) contains 32 values in total, and only six of those are QM properties. We train two extra models where the RDKit part of the GCN (RDKit) and GCN (RDKit + QM) models are compressed into a six and 12-dimensional vector, respectively. By compressing the RDKit part of the GCN (RDKit + QM) before concatenating with the six QM properties, we transform the node embedding into a 12-dimensional feature vector with equal weight on the RDKit and QM features. The compression is performed by passing the RDKit input feature vector through a dense linear layer with a ReLU activation function (red layer in Figure 2) before concatenating with the QM descriptors. Again, we observe no significant differences between the two models when comparing the mean accuracies in Table 1 or the cumulative accuracy in Figure 3. If anything, the GCN (12 RDKit) model performs slightly better than the GCM (6 RDKit, 6 QM). These findings refute our hypothesis that the lack of differences is a result of uneven distribution of RDKit and QM features.
To better understand what the QM properties do to the model, we train two additional models. The first model we name GCN (Atom type) and is the most basic model we trained. The GCN (Atom type) model only contains node features that describe the kind of atom (same atom encoding as the GCN (RDKit) model). The second model is called GCN (Atom type + QM) extends the GCN (Atom type) model by including the six QM atom properties. Unsurprisingly, the GCN (Atom type) model performs the worst since it contains much less information about the atoms. The top-1 mean accuracy (Table 1) drops from 58.6% to 48.8% compared to the GCN (RDKit) model. However, when the QM properties are included, the top-1 accuracy increases to 56.5%, which is only a decrease of 2.1% points compared to the GCN (RDKit) model and only a drop of 0.7% points for the top-50 accuracy. This suggests that there is a significant correlation between the RDKit features and the QM property features we chose. Such correlations can explain why we see no differences in accuracy between the GCN (RDKit) and GCN (RDKit + QM) model since they, in essence, carry similar information about each atom. It also explains why the GCN (Atom type + QM) and the GCN (RDKit) model almost perform the same.
Applicability of the predicted templates. The accuracy of the expansion network only reflects the network's ability to match one product to one template. In practice, when the network is applied to retrosynthetic route planning, the top-50 templates (or a cumulative probability of 0.995) are typically applied to the target product molecule [14,1]. However, even a highly accurate model is not guaranteed to recommend applicable templates. This problem gives rise to high failure rates for the proposed templates due to the template's inability to match a substructure in the target for which it is was predicted. Table 2 shows the average number of templates it takes to reach cumulative probability above 0.995, and the number of those that actually can be applied to the target molecule. Here we notice differences between the ECFP recommender model and the GCN (RDKit) model. The most notable is the average number of templates that needed to reach a probability of 0.995 is 42.3 for the ECFP model compared to 31.9 for the GCN (RDKit) model. Which could indicate that the GCN (RDKit) model is slightly more certain about the top-1 prediction compared to the ECFP model, and therefore does not need many extra templates to reach a cumulative probability of 0.995. This is confirmed by comparing the average top-1 probability for the GCN (RDKit) and ECFP models which is 0.69 and 0.63, respectively. Because we include less templates we also see a slight decrease in the number of applicable templates from 16.3 to 12.7. However, the success rate of applicable templates in the GCN (RDKit) and ECFP models is approximately 40% in both cases. By comparing the GCN (RDKit) and GCN (RDKit + QM) we see that the average number of templates needed to reach a cumulative probability of 0.995 and the number of applicable templates is practically unaffected by including our chosen QM properties.
The maximal number of applicable templates we find is 16.3 for the ECFP model. To ensure that it is not just the average number of applicable templates for the target molecule (of the 46,695 unique templates), we apply all templates to all target molecules. The distribution of applicable templates in the test set for each target molecules are seen in the SI Figure S 2, but on average can 281 templates be applied to the target. The network thus still provides a clear enrichment of applicable templates in the top prioritized templates.
The fact that we, on average, can apply 281 of the 46,695 templates raises an essential question about the training setup. For retrosynthetic prediction, we are not in general interested in training a network that only recreates the one-to-one mapping found in the training set (target molecule to a template). Instead, we want the network to propose several applicable templates, preferably a mix of rare and common templates. However, as training is set up for most data-driven retrosynthetic tools available, the network is, during training, attempting to recreate the one-to-one mapping. This is due to the choice of the loss function, i.e. the cross-entropy loss, which only rewards the network if it succeed in an exact one-to-one mapping [1,14]. The discrepancy between the training objective and how we are applying the model means, in the extreme case, that we are relying on coincidences that there is more than one applicable template among the top-50 templates. The training setup will force the network to get as close as possible to predict precisely one template with more certainty. During training the network will attempt to make the top-1 probability as close as possible to 1. Thus, we will likely observe that the more accurate (or specific) the model becomes, the fewer templates the model recommend (given the same probability cut-off). Because the model is more confident in its prediction (top-1 probability is closer to 1), we needed fewer templates to reach a cumulative probability of 0.995. This trend is observed when we compare the average templates of the GCN (RDKit) and ECFP recommender model in Table 2.
The GCN type of networks thus generally offers an advantage over the ECFP network, with a better top-50 accuracy for less frequent reactions. But the inclusion of QM data does not offer any improvements over the GCN (RDKit) model that only uses RDKit properties.
Random oversampling/undersampling. As an alternative to adding information on the atomic level, we also tested oversampling of seldomly used templates. The reason for oversampling/undersampling in machine learning is that classifiers (as the recommender model) are typically more sensitive towards the majority class. This can be illustrated by analyzing the mean template occurrence, which describes the average reported instances of the templates used/predicted of the reactions in question. For example, two reactions rely on templates reported 50 and 100 times, resulting in a mean template occurrence of 75. In the USPTO test set, the mean template occurrence is 1112, while the top-1 mean template occurrence for the GCN (RDKit) model is 1742. This indicates that the trained model generally predicts more frequently used templates than the underlying dataset dictates. To overcome the bias, more weight can be assigned to the minority classes in the training set by sampling them with replacement or reducing the importance of the majority classes in training set by undersampling, which means that you only choose some samples from the majority class. The objective of oversampling/undersampling is to get a more even accuracy across all types of reactions, which would be a more horizontal line for the cumulative template occurrence in Figure 5.
We train the GCN (RDKit) model on three different training sets to test how random under-/oversampling influences the recommender models performance. The three different training sets all have varying levels of oversampling and undersampling. The first training set is created by oversampling. We sample ten reactions (with replacement) for each template class that occur ten or fewer times in the training set (35,355 of the 46,673 templates). The training set contains 1,022,940 reactions. The second training set is sampled identically to the first one, but instead, we use a cut-off of 50 (44,636 of the 46,673) occurrences. With a cut-off of 50, we get a training set with 2,717,960 reactions. The third training set is curated by both under-and oversampling the reactions, so the number of reactions in each template class is equal. We do this by sampling 30 reactions, with replacement, from each template class, resulting in a training set size of 1,400,190 reactions. Figure 5 clearly illustrates that by oversampling the less frequent reactions, we see a significant increase in the accuracy for the less used templates. The top-1 accuracy for reactions that occur less than or equal to five times increases from 33.8% to 50-54%, depending on how the test set sampling is performed. However, the way that the model behaves for frequent reactions is quite different. The model where each template is represented by exactly 30 reactions (grey line) shows a dramatic decrease in accuracy for frequently used templates. Also, the average template occurrence for the model is only 78, which suggests that the model is heavily biased towards less frequent reactions. The strong bias indicates that we have thrown too much information away due to a very aggressive undersampling of common templates. The two other models, that oversample the rare reactions (cyan and yellow lines), both clearly exhibit a more horizontal trend. The horizontal trend indicates that the accuracy are more balanced across all templates than the model trained on the unprocessed training set. Another way to illustrate this is by comparing the average top-1 template occurrence which is 1492 and 1389 for the slightly oversampled (cut-off 10, cyan line) and the "heavily" oversampled (cutoff of 50, yellow line), respectively. These average template occurrences are much better in line with the true values of the test set, albeit there is still room for improvement. The slightly over oversampled model performs worse on less frequently used templates then the "heavily" oversampled model. However, around the a template of 100 occurrence the "heavily" oversampled model falls below the model trained on the standard training set. The drop below the standard model means that even though the accuracy for rare reactions is better the overall accuracy of the model is a bit worse. The applicability of the templates recommended by the slightly and "heavily" oversampled models are pretty similar. As seen in Table 2, both models, on average, needs the top-28 templates to reach a  cumulative probability of 0.995 (or maximally top-50). But among the recommended templates, only 10 and 11 of them are applicable. Figure 4 show the average cumulative probability of the models. It illustrates that the increase in cumulative probability is more significant when fewer templates are included for the oversampled graph methods. It confirms the tendency that increased model confidence will result in fewer templates needed to reach the cumulative probability of 0.995 (or maximally 50 templates). Consequently, it will recommend fewer applicable templates. Compared to the ECFP recommender model, the two oversampled models will find 5 and 6 fewer applicable templates from the USPTO test set.

Conclusions
We investigated how a selection of semi-empirical QM properties would influence the accuracy of the expansion policy networks used for retrosynthetic planning. We used graph representations to train graph neural networks to incorporate the atomic QM properties into a meaningful molecular descriptor. We trained the neural graph networks with and without QM properties and found that the selected semi-empirical QM properties did not affect the accuracy or template applicability. It is important to highlight that we only tested a few QM properties and did not perform an exhaustive test of QM properties. Other properties could potentially yield a different conclusion as illustrated by Stuyver et al. [11]. By changing from the ECFP representation to a graph representation, the top-5, top-10, and top-50 accuracies increased by 5.5-8.5% points. When switching to a graph representation, we observed that the number of templates needed to reach a cumulative probability of 0.995 decreased from 42.2 to 31.3. This decrease resulted in a slight reduction in the number of applicable templates (16.3 to 12.7), which made us question the contradictory way we train and apply the expansion policy network. When the expansion policy network is trained, we target a one-to-one mapping between the target molecule and template documented in the training set. However, when the network is applied for retrosynthesis prediction, we expect the network to return multiple applicable templates for the target molecule. We also tested if oversampling/undersampling can increase the accuracy of rare templates without decreasing the accuracy of the frequently used ones. These relatively simple techniques result in a substantial top-1 accuracy increase of approximately 20% for templates represented five times or less in the USPTO reaction dataset. We tested three different ways of oversampling/undersampling and found that not all methods perform equally well. If the templates are sampled evenly, you see a severe decrease in accuracy for templates used more than 100 times. But if you only oversample templates observed 50 times or less, we do not observe the same severe decrease in accuracy for frequently used templates.