A Hybrid Alchemical Free Energy and Machine Learning Methodology for the Calculation of Absolute Hydration Free Energies of Small Molecules

A methodology that combines alchemical free energy calculations (FEP) with machine learning (ML) has been developed to compute accurate absolute hydration free energies. The hybrid FEP/ML methodology was trained on a subset of the FreeSolv database, and retrospectively shown to outperform most submissions from the SAMPL4 competition. Compared to pure machine-learning approaches, FEP/ML yields more precise estimates of free energies of hydration, and requires a fraction of the training set size to outperform standalone FEP calculations. The ML-derived correction terms are further shown to be transferable to a range of related FEP simulation protocols. The approach may be used to inexpensively improve the accuracy of FEP calculations, and to flag molecules which will benefit the most from bespoke forcefield parameterisation efforts. Abstract A methodology that combines alchemical free energy calculations (FEP) with machine learning (ML) has been developed to compute accurate absolute hydration free energies. The hybrid FEP/ML methodology was trained on a subset of the FreeSolv database, and retrospectively shown to outperform most submissions from the SAMPL4 competition. Compared to pure machine-learning approaches, FEP/ML yields more precise estimates of free energies of hydration, and requires a fraction of the training set size to outperform standalone FEP calculations. The ML-derived correction terms are further shown to be transferable to a range of related FEP simulation protocols. The approach may be used to inexpensively improve the accuracy


Introduction
Alchemical free energy calculations (or Free Energy Perturbation -FEP-) are increasingly used in academia and industry to support ligand optimisation problems in the early stage of drug discovery. 1-4 The domain of applicability of current alchemical methodologies has to date mainly been restricted to hit-to-lead and lead optimisation scenarios owing to limitations in computing cost, conformational sampling, and the accuracy of the potential energy functions used to compute protein-ligand energetics. 5-10 There is continued interest in the development of more accurate potential energy functions to benchmark FEP workflows on diverse well curated diverse protein-ligand datasets, 11-14 and for applications to blinded challenges or methodological studies. [15][16][17][18][19][20] The calculation of hydration free energies has historically been an important stepping stone towards more accurate forcefields for protein-ligand binding free energy calculations. [21][22][23] Blinded competitions such as SAMPL have also focused on hydration free energy calculations. 24 Forcefield parameterization is a painstaking challenge that requires meticulous and laborious efforts to yield steady gains in accuracy. R ecent parameterization efforts from the Open Force Field, AMBER, CHARMM communities have involved multiple groups. Recent work has sought to simplify the parameterisation process by direct chemical perception of hierarchical parameter types. 25 Nevertheless it can be difficult to identify what modifications to introduce to improve the accuracy of parameter sets. Ultimately fundamental limits in accuracy cannot be overcome due to an incomplete description of the physics of the process, for instance due to use of fixed-charge forcefields that neglect polarisation effects. 26 Notably this realisation has prompted the development of post-processing methodologies based on quantum mechanical (QM) calculations to introduce correction terms for hydration and binding free energies computed by FEP methods using a classical force field. [27][28][29][30][31] Data-driven machine-learning (ML) methods have witnessed a resurgence of interest in drug discovery in recent years. Impressive advances have been made in the area of machine learning of quantum chemical calculations, 32,33 virtual screening, 34,35 and free energies of 3 hydration. [36][37][38] Efforts such as DeepChem 39 and MoleculeNet 40 have popularised the use of ML methods for molecular property predictions. Recent efforts have made use of 3D convolutional neural networks or other graph convolutional neural networks to predict binding affinities from the spatial structure of protein-ligand systems. 41,42 While impressive results have been demonstrated, the performance of ML methods is limited by the requirements of often substantial training sets, and a rapid decrease in accuracy when applying the models to molecules that are dissimilar to those that were included in the training set.
In previous work undertaken by our group as part of the SAMPL6 competition 20 we observed that empirically correcting FEP-derived host-guest binding free energies by a linear regression model calibrated on preceding SAMPL5 submissions, 43 led to significant decrease in mean unsigned error (MUE) of the predicted binding affinities. The present study extends this approach with machine-learning regression models that act as empirical correction terms to the FEP results. That is, the ML models are trained to predict the mistake compared to experimental values in Gibbs free energy that alchemical calculations make, referred to from here on as the ∆G of f set .
For any given alchemical prediction ∆G F EP and associated experimental free energy ∆G EXP , ∆G of f set is defined as the difference between the two; it also constitutes the training label for the given perturbation. This method relies on the assumption that given a training set of sufficient size, an empirical model trained on this set will be able to estimate accurately ∆G of f set values for a new set of alchemical predictions, thereby compensating for systematic errors in the underlying alchemical methodology.
As a proof-of-concept, we explore absolute alchemical calculations of hydration free energies performed with GROMACS. 44 Our results show that the proposed hybrid FEP/ML methodology leads to significant improvements in the accuracy of calculated hydration free energies, whilst only requiring modest training sets compared to a pure machine learning approach.

Theory & methods FEP/ML model generation
The present methodology describes a regression model that fits the mistake that an alchemical calculation makes for a given molecule A, where the mistake is defined by equation 1: where ∆G F EP (A) is the hydration free energy of molecule A calculated by the alchemical method, and ∆G EXP (A) is the experimentally determined hydration free energy for the same molecule. For a given training set with defined descriptors, machine-learning models were used to fit the training domain using five-fold cross-validation over 10 replicates, resulting in a total population N pop of 50 trained models (see methods section below). All individual models in N pop are regression models predicting their own ∆Ĝ of f set value. We define our offset estimator as the arithmetic mean of these offset values, and use the standard deviation of the mean as a measure of the precision of the calculated offset. Thus we define a corrected hydration free energy as:

Feature generation & pre-processing
Features (descriptors) were generated for all compounds present in FreeSolv. The ML models in this study were generated using RDKit 2019.03.4.0. 52 Molecules were loaded using the provided SDF files, and featurized using the following classes on standard settings unless indicated otherwise: • APFP : Atom-pair fingerprints were generated using rdkit.Chem.rdMolDescriptors.GetHashedAtomPairFingerprint(); length was set to 256.
• MolProps: Molecular properties were generated using the Mordred python API 53 with inclusion of 3D properties. Although the total number of descriptors that this API generates is 1825, non-numeric columns were excluded resulting in 1113 properties that constitute the features per compound.
Additionally, all fingerprints were appended individually to MolProps features (resulting in for instance a feature set called 'MolPropsAPFP' which was obtained by appending 'APFP' to 'MolProps') resulting in fingerprints with a length of the sum of both feature sets (in the case of MolPropsAPFP, 1113 + 256 = 1369). Every feature set was subsequently Znormalized to zero mean and sklearn.decomposition.PCA was used to reduce dimensionality using a principal component analysis, and retaining principal components contributing up to 95% of the variance.
After data pre-processing, the corresponding label (∆G offset , see Eq. 1) was appended to each data point in order to build the final training set (named 'FEP/ML'). Additionally, a second training set (named 'ML') was generated by using as labels (output variables) the experimentally-determined ∆G exp value for each data point.
A 5-fold cross-validation approach was chosen to reduce the risks of overfitting the training set. The training set was thus randomly split into five equally-sized folds (of sizes 595/5=119). Training was repeated five times, rotating the folds so that each fold acted as the validation set once for the other four training set folds. Additionally, training was performed with 10 replicates per feature set, resulting in a total of 50 trained models per feature set-ML model combination. and deep neural networks (DNN) protocols ( Figure S1). For MLR this is to be expected because of the relative simplicity of the model. Although the RF algorithm is more complex, it is primarily designed for classification problems rather than regression problems due to its dependence on decision trees, which may explain its underfitting. The algorithm is included as a control in the current study.

Machine-learning models
A range of different feature sets was used to identify efficient encodings for describing ∆G of f set . A general trend in feature set performance can be observed across ML models. MolProps and combinatorial feature sets (fingerprints appended to MolProps) fit the training set better than standalone fingerprints (APFP, TOPOL and ECFP6), and X-NOISE performs worst as expected since this feature set is generated from random data.
Because standalone MolProps generally outperform standalone fingerprints, it is likely that 9 the combined feature sets benefit mainly from the more predictive MolProps component.
The observation that MolProps appears to outperform other feature sets suggests some of the descriptors included in MolProps correlate well with free energies of hydration. This is reinforced by our observation that the MolProps feature set outperforms generally other feature sets when predicting ∆G of hydration directly in our pure ML models (see figure   S1).
Although the Extended-connectivity fingerprint (ECFP) 56 is used extensively in QSAR regression problems, our training protocol suggests underfitting of the training set for this feature type. This is likely because the used diameter of six bonds is too large to accurately discriminate between the relatively small compounds in the FreeSolv database (see figure S5); testing with smaller diameters suggests an increase in fitting ability, however these models still underperform with respect to other feature types (see figure S3).

Hybrid FEP/ML models outperform standalone FEP and ML models in SAMPL4
The trained models were used to predict on the Freesolv-SAMPL4 test set. Because low errors in training validation do not necessarily translate into low errors in testing validation, all trained models were tested (see figure S2 and table S2). Top-performing models per ML model (see figure 1) were based primarily on the MolProps feature set for SVM, RF and MLR, but not for DNN. It is likely that the latter suffers from a degree of overfitting causing individual models to differ widely in predicted offset values. This is apparent in the much larger uncertainties in dataset metrics for DNN. Nevertheless the accuracy of the predictions obtained by averaging over the 50 DNN models is competitive. Overall SVM appeared to give more consistently accurate and precise estimates of ∆G of f set values.
One compound in the test set (mobley 4587267, (2R,3R,4R,5R)-hexane-1,2,3,4,5,6-hexol, referred to as mannitol from hereon) stands out with a free energy of hydration significantly more negative than other compounds in the test set (∼ -24 kcal·mol −1 ). This compound has  The top-performing ML model (SVM; MolProps, Figure 2C) achieves accuracy similar to FEP, but with larger uncertainties. This trend worsens for the top-performing DNN model ( Figure 2D). As noted before, mannitol contributes substantially to model performance: ML performed broadly similarly to FEP, but the uncertainty of the metrics is again remarkably large. This indicates that there is significant variability in the predicted free energies of hydration of the same compound by the ensemble of ML models. By contrast the FEP/ML predictions are of similar precision to the FEP predictions as the uncertainties in the offset terms is comparable or smaller to the uncertainties in the alchemical estimates.

Influence of training set size on accuracy of correction terms
We also evaluated the impact of training set size on the accuracy of the correction terms  The offsets are transferable to a number of related SAMPL4 submissions The transferability of the ML-derived offsets to related simulation protocols was also assessed to evaluate the general applicability of the methodology. Figure 5 summarises changes in metric ranks for all complete submissions that featured an FEP methodology (n=19). Overall the offsets improved/maintained/worsen the rankings of 12/5/2 submissions for Pearson r; 10/3/6 submissions for MUE and RMSE; 9/6/4 submissions for Kendall Tau. Importantly with one exception (see below) the offsets do not worsen the ranks of the top-performing submissions.
As expected, SAMPL4 submission 004 is among the entries that benefit the most from the correction terms. Several entries that used a similar forcefield (GAFF and AM1-BCC charges, gromacs simulation engine) but a different simulation engine or different free energy estimation protocols (e.g. 137, 168, 544, 575) also show improvements in metrics. This is reasonable as it has been shown that, when properly implemented, hydration free energies computed with the same forcefield by different simulation engines will broadly agree to within

Conclusions
This work has demonstrated that it is possible to combine 'physics-driven' FEP methods with 'data-driven' machine learning methods to predict absolute hydration free energies of small molecules. The chief advantage over FEP is that improvements in the accuracy of the predictions are achieved without having to embark in cumbersome forcefield parameterisation efforts. When compared with ML, the FEP/ML approach outperforms FEP with a much smaller training set size. This is significant as it indicates that for a new dataset it is possible to make predictions without any available experimental data initially, and switch to an FEP/ML approach once a number of data points have been experimentally determined.

References
(      Table S3: Key statistics for machine-learning models combined with FEP (FEP/ML entries) and pure machine-learning models predicting ∆G directly (ML entries), sorted by MUE in ascending order. Uncertainties per statistic are given as plusminus entries. For these results, the mannitol outlier (mobley 4587267) has been removed. Figure S3: Hyperparameter optimisation of support-vector machine models fitting on ∆G of f set for different ECFP diameter sizes computed for compounds in the FreeSolv database.
Depicted are the number of hyperparameter calls versus cumulative minima of training validation mean unsigned error in kcal·mol −1 for extended-connectivity fingerprint diameters 2, 4, 6 and 8. Error regions are computed as standard deviation across ten replicates. The black dashed line indicates the converged validation error for the top-performing model in the main text body (SVM-MolPropsAPFP).    Figure S8: Changes in ranks of SAMPL4 submissions after application of offsets to predicted hydration free energies. Depicted are the top 20 SAMPL4 non-FEP entries before (blue) and after (orange) hybridisation with the SVM-MolPropsAPFP correction term. Entries were sorted by total ranks gained in ascending order.