Reaction Classiﬁcation and Yield Prediction using the Diﬀerential Reaction Fingerprint DRFP

Predicting the nature and outcome of reactions using computational methods is a crucial tool to accelerate chemical research. The recent application of deep learning-based learned ﬁngerprints to reaction classiﬁcation and reaction yield prediction has shown an impressive increase in performance compared to previous methods such as DFT-and structure-based ﬁngerprints. However, learned ﬁngerprints require large training data sets, are inherently biased, and are based on complex deep learning architectures. Here we present the diﬀerential reaction ﬁngerprint DRFP . The DRFP algorithm takes a reaction SMILES as an input and creates a binary ﬁngerprint based on the symmetric diﬀerence of two sets containing the circular molecular n-grams generated from the molecules listed left and right from the reaction arrow, respectively, without the need for distinguishing between reactants and reagents. We show that DRFP outperforms DFT-based ﬁngerprints in reaction yield prediction and other structure-based ﬁngerprints in reaction classiﬁcation, reaching the performance of state-of-the-art learned ﬁngerprints in both tasks while being data-independent.


Introduction
Computational methods to predict the nature and outcome of reactions are important tools to accelerate chemical research. [1][2][3][4][5][6][7][8][9][10][11] The nature of a reaction is well-described by its name and class, where a reaction class is defined by the general reaction-type and the participating chemical entities. [12][13][14] Automating the classification of reactions provides a tool for chemists to search databases and to quickly evaluate and optimise a novel reaction based on the nature of similar reactions. An important outcome of a chemical reaction is its yield, the percentage of successfully converted reactants into the desired product. Computational methods for predicting such yields are highly valuable in synthesis-planning, where high yields are of paramount importance-especially in multi-step reactions. Earlier work used physics-based descriptors or structure-based molecular fingerprints to classify chemical reactions or predict reaction yields. 6,15,16 However, computational complexity and inherent biases have introduced seemingly insurmountable challenges to these approaches. Recently, with the availability of large data sets and the resurgence of artificial neural networks (ANN), deep learning-based learned fingerprints have been introduced as an alternative to earlier methods, outperforming them by considerable margins. 11 However, these approaches come with several drawbacks as well. Training a learned fingerprint requires large amounts of data of acceptable quality and must be retrained when new data becomes available, posing a challenge to accessibility and reproducibility. Due to the nature of the ANNs, learned fingerprints are challenging to interpret, as they, for example, require a careful analysis of attention weights. 11 Finally, the training and evaluation of the models require specialised hard-and software to become computationally tractable.
Here we present the differential reaction fingerprint (DRFP ) for reaction search and categorization as well as yield prediction. The reaction fingerprint DRFP borrows the creation of circular substructures from a molecule and the subsequent hashing of their SMILES representations from the chemical fingerprints ECFP and MHFP, respectively (see Figure 1 and Molecular n-grams). 18,19 However, as reaction SMILES consist of mul-tiple molecules in the form REACTANTS>AGENTS>PRODUCTS, three additional steps have to be introduced: (I) The agents are added to the reactants, resulting in the representation REACTANTS+AGENTS>>PRODUCTS; (II) molecules on each side of the reaction representation are processed individually, resulting in two sets of SMILES R and P ; (III) the symmetric difference of the two sets S = R P is taken, hashed using an arbitrary hash function with a sufficiently low collision probability (BLAKE2), and then further hashed into a fix-length binary vector using h(k) = k mod d, where k ∈ S, and d is the desired dimensionality of the fingerprint. Compared to the approach introduced by Schneider et al. 16 , DRFP does not apply weights based on atom-mapping to differentiate between reactants and agents, does not require the calculation of molecular properties for the agents, and does not apply arithmetic operations on individual molecular fingerprints, such as the atom pair fingerprint, to create a reaction fingerprint.
Given this conceptually simple fingerprint, we show that its performance, when applied to tasks mentioned above, rivals or even surpasses that of state-of-the-art methods while using minimal non-specialised computational resources and no specialised hard-or software (see Computational Resources). The fingerprint requires an unannotated, non-atommapped reaction SMILES as input and embeds this molecular representation from reaction SMILES space into an arbitrary low dimensional binary metric space through set operations and subsequent mod hashing. We show that a k-NN classifier trained with DRFP significantly outperforms those trained on existing, non-learned fingerprints and rivals or surpasses the performance of learned fingerprints without the need for supervised learning pre-classification. Furthermore, the fingerprint can act as an unbiased benchmark for new methods. Finally, we show that this method, based on a simple set operation and hashing scheme, can outperform both deep learning-based learned fingerprints and physics-based descriptors in yield prediction tasks. We make the fingerprint creation algorithm available as a pypi package (drfp). The source code and documentation are available on GitHub (https://github.com/reymond-group/drfp).  Figure 1: Encoding a reaction 17 without distinguishing between reactants and agents into an DRFP fingerprint is achieved by first extracting circular substructures of radius r (r = 3 in the above example) into two sets (blue and red circles for reactants and products, respectively). In a second step, the two sets' symmetric difference (blue and red shaded areas) is hashed using an arbitrary hash function. Finally, the resulting set is hashed into a binary vector using modular hashing.

Reaction Classification
The reaction classification was carried out using the k-nearest neighbor classifier based on Reducing the training set to 10 and 1% of its original size, aside from a general reduction in accuracy, also leads to a better relative performance of the r = 2 variant across all dimensions d (Figure 2b,c). These results suggest that choosing the r = 2 variant might be advantageous in low data settings, and there is no value in choosing r = 4 over r = 2 or r = 3, independent from d and the amount of available training data. However, as the r = 3 variant performed best in the case of the complete training set for high d, the r = 3 and d = 2048 variant is chosen for all further benchmarks, including reaction yield predictions.  This result suggests that conceptual complexity, including learning, can be factored out of fingerprint creation, moving it instead to the classification task with a minor impact on classification performance. A non-learned fingerprint has the advantages of reducing bias and increasing the interpretability of results as each feature can be mapped to one or more molecular substructures.

Reaction Yield Prediction
Comparing the yield prediction performance of DRFP to that of learned and physical descriptor-based fingerprints shows that this simple fingerprint is competitive, as it demonstrates consistent performance on all test sets. Averaging the 11 tests shown in Table 2,  and an augmented variant of the latter ( Table 2). The data set used is a collection of 3,955 Pd-catalysed Buchwald-Hartwig C-N cross-coupling reactions from a high throughput ex-periment by Ahneman et al. 6 . For this data set, 11 splits were defined; seven splits where the relative size of the training set was decreased from 70 to 2.5% and four out-of-sample splits based on isoxazole additives. DRFP performs better on the random splits than the DFT-based fingerprint with random forests and Yield-BERT but is outperformed by the augmented Yield-BERT by a narrow margin. In the out-of-sample splits, DRFP performs better than the augmented version of Yield-BERT and the DFT-based method, yet the nonaugmented Yield-BERT performs slightly better. When averaging over all 11 tests, DRFP performs best.

Conclusion
We have introduced a reaction fingerprint encoding scheme, DRFP, based on a simple 4-step process comprised of extracting circular n-grams, XORing, hashing, and folding. DRFP is capable of reaching state-of-the-art performance without extending the use of machine learning models from classification or regression tasks to the fingerprint creation task. The fingerprint creation algorithm is available as a pypi package (drfp). Source code and documentation are available on GitHub (https://github.com/reymond-group/drfp).

Computational Resources
We ran all of the training runs as well as the evaluations of the models on a DELL XPS Laptop with 16 GB of main memory, no dedicated GPU, and an 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz CPU.

Molecular n-grams
Molecular n-grams are generated from SMILES using the RDKit library. Given a radius r, we iterate over the heavy atoms in an input molecule and extract sub-SMILES centred on each atom with radii 0 to r, where a radius of 0 is the single central atom. In addition, rings from the SSSR (smallest set of smallest rings) are extracted as well. Compared to the atom pair-based approach by Schneider et al. 16 , the n-grams-based fingerprint also captures stereochemistry.

Gradient Boosting
For regression by gradient boosting, we used the Python library xgboost. Hyperparameter tuning was carried out on the rand 70/30 set of the Buchwald-Hartwig reaction data set.

4:
Add substructure with radius rooted at atom to shingling as SMILES 5: end for 6: end for 7: for ring in sssr(molecule) do 8: Add substructure of ring to shingling as SMILES 9: end for We applied the same hyperparameter values (n estimators=999999, learning rate=0.01, max depth=15, min child weight=8, colsample bytree=0.2125, subsample=1) in all uses of xgboost. For each test, 10% of the training data were randomly selected as the validation set an removed from the training set. The validation data sets were used as evaluation sets for early stopping (20 rounds for all data sets with the exception of the USPTO, data for which 10 rounds were used to speed up the calculation).

k-Nearest Neighbours Classifier
The k-Nearest Neighbour classifier was implemented according to Schwaller et al. 11 using faiss with k = 5.

Multilayer Perceptron Classifier
In addition to DRFP + 5-NN classifier, DRFP + multilayer perceptron (MLP) classifier was applied to the USPTO 1k TPL data set. The MLP was implemented using Tensorflow 2.4.1 and consists of an input layer the size of the input vector (2,048), a dense hidden layer of size 1,664 and a tanh activation function, and a dense output layer with a softmax activation function. The loss function was sparse categorical cross-entropy. Adam was used as an optimiser. The model was trained over 10 epochs with a batch size of 64 on a CPU.
For the evaluation of AP3 256, the number of units in the hidden layer was changed to 1024, and the model was trained for 100 epochs.