Improving ∆∆ G predictions with a multi-task convolutional Siamese network

The lead optimization phase of drug discovery reﬁnes an initial hit molecule for desired properties, especially potency. Synthesis and experimental testing of the small perturbations during this reﬁnement can be quite costly and time consuming. Relative binding free energy (RBFE, also referred to as ∆∆ G ) methods allow the estimation of binding free energy changes after small changes to a ligand scaﬀold. Here we propose and evaluate a Convolutional Neural Network (CNN) Siamese network for the prediction of RBFE between two bound ligands. We show that our multi-task loss is able to improve on a previous state-of-the-art Siamese network for RBFE prediction via increased regularization of the latent space. The Siamese network architecture is well suited to the prediction of RBFE in comparison to a standard CNN trained on the same data (Pearson’s R of 0.553 and 0.5, respectively). When evaluated on a left-out protein family, our CNN Siamese network shows variability in its RBFE predictive performance depending on the protein family being evaluated (Pearson’s R ranging from -0.44 to 0.97). RBFE prediction performance can be improved during generalization by injecting only a few examples (few-shot learning) from the evaluation dataset during model training.


Introduction
Lead optimization is an early phase of the drug discovery process that simultaneously optimizes a hit molecule for potency, solubility, and other toxicological and pharmaceutical properties. Small modifications are made to the chemical scaffold of the hit molecule and tested for their effect on the properties of interest. A collection of such molecules, along with the initial hit molecule, is termed a congeneric series. Congeneric series developed within a drug discovery campaign can contain tens to hundreds of compounds, 1,2 with the synthesis and testing of each chemical modification taking considerable amounts of time and money. Relative binding free energy (RBFE, also called ∆∆G) methods provide an in silico alternative to the labor intensive synthesis and experimental testing of each compound in a congeneric series.
RBFE methods strike a balance between accuracy and throughput. Typical methods for RBFE determination utilize either molecular dynamics with alchemical perturbations or thorough sampling of the endpoints of the transformation. Alchemical methods, also called pathway methods, perturb the bound molecule from one ligand into another using chemical or alchemical means; 3 Free Energy Perturbation 4 (FEP) is one of the most popular alchemical methods. FEP utilizes explicitly solvated molecular dynamics or Monte Carlo simulations in which one ligand is alchemically transformed into another ligand. Recent advances in molecular mechanics force fields, sampling, and reductions in computational cost have encouraged the adoption of the FEP approach for the prediction of RBFE in both academia and industry. These advances have allowed for very high accuracy of FEP approaches, within about one kcal per mol. However, the current implementations only allow for the calculation of about four ligand perturbations per day with commonly available computing resources. 5 FEP is somewhat limited to a maximum number of changes between ligands, about 10 heavy atoms, due to the high amount of sampling required for each change of a ligand. However, with careful consideration, more heavy atoms can be changed between the ligands while still achieving low errors in predictions. 6,7 Endpoint sampling methods reduce the required amount of molecular dynamics needed to determine the free energy of the system. 8 The most popular methods for determining RBFE with endpoint sampling are molecular mechanics Poisson-Boltzmann Surface Area (MMPBSA) and molecular mechanics generalized Borne surface area (MMGBSA). MMBPSA and MMGBSA, developed by Kollman et al., 9,10 evaluate the free energy through molecular dynamics simulations of the unbound ligands and the bound complexes. RBFE is computed via a simple difference of the energetics in each of the ligand binding modes. 11 While these methods have reduced computational requirements in comparison to FEP, their free energy predictions are not as rigorous. This limited throughput of molecules and low allowance for changes between molecules can prevent medicinal chemists from fully exploring the optimization space of a lead molecule.
A number of scoring functions have been developed to simultaneously provide low error and high throughput for absolute binding affinity predictions. [12][13][14][15][16][17] These are able to replace the more thorough and compute intensive simulation based methods for measuring the energy of the absolute binding affinity. More recently, these scoring functions utilize deep learning to infer absolute binding affinity directly from the bound protein-ligand complex. [13][14][15][16] Using these deep learning absolute binding affinity methods as inspiration, Jiménez-Luna et al. 18 utilize a Siamese Convolutional Neural Network (CNN) architecture to directly determine the RBFE between two bound protein-ligand complexes. This architecture removes the compounding error of determining the RBFE with the difference in absolute binding free energies of the two ligands. They showed the potential of their trained neural network in retrospective lead optimization campaigns with only a small amount of retraining required. Here we present further expansion on their Siamese network by introducing novel loss function components. We evaluate the impact of our loss components as well as the Siamese architecture on the predictive performance of our models. Generalizability of the RBFE predictions are examined through both external datasets and clustered protein family cross validation.

Methods
We describe the filtering and usage of the training and evaluation datasets for our RBFE models. The architecture and training hyperparameters of our CNN Siamese network are explained. Our CNN Siamese network is compared to state of the art methods for RBFE prediction on a retrospective lead optimization task and several benchmark datasets. Next, we investigate the relative importance of the components of our model on a small retrospective lead optimization task. Finally, we evaluate our model on novel protein families utilizing a leave-one-family-out cross validation to elucidate the generalizability of our model.

Data
Proper training of a deep learning model for RBFE in a lead optimization setting requires that we utilize congeneric series with experimentally validated binding affinity measurements.
We therefore utilize the BindingDB 3D Structure Series dataset. 19 This dataset was created by combing the literature for experimental binding affinities of many ligands bound to the same receptor and finding the crystal structure of at least one of those ligands bound to the same receptor. The ligands with no known bound structure were computationally docked to the protein using the Surflex docking software 20 for template docking with the crystal ligand. The full dataset encompases 1038 unique receptor structures with an average of 9.61 ligands bound to each receptor structure. We filter the dataset to ensure the binding affinity measurements are high quality and to enforce comparisons between ligands with identical measures of potency: IC 50 , K d , or K i . First the dataset is split into three different groups, one for each of the measures of potency. A ligand can be in multiple groups if it has binding affinity measurements for multiple measures of potency. For each split, we strip any greater than (>) or less than (<) symbols from the binding affinity measurements of every ligand and use the remaining string as the exact binding affinity value. If a ligand has multiple measurements for a given measure of potency, we delete the ligand from that measure of potency split if the range of the measurements is greater than one order of magnitude.
Otherwise we take the median of the multiple measurements. After this filtering, we remove any ligands that have binding affinity information for a PDBID that has no other ligands with binding affinity measurements. We then construct congeneric series by creating ordered pairs of ligands that have the same receptor structure and the same measure of potency (IC 50 , K d ,K i ). We utilize the log-converted measurements (− log 10 (value)), referred to as "pK", for each measure of potency. We next define a reference ligand. The reference ligand is assigned as the ligand with the highest Tanimoto similarity (using the RDKFingerprint from RDKit 21 ) to the ligand used for the template docking, usually the ligand in the crystal that was used for template docking. Our final filtered BindingDB dataset has 1082 congeneric series, encompassing 943 unique receptor structures with an average of 7.995 ligands per congeneric series. The average pK range of each congeneric series is 2.023 pK. Histograms of the number of ligands and the affinity ranges per congeneric series are shown in Figure S1.
We utilize the datasets provided by Mobley et al., 22 Wang et al. 7 , and Schindler et al.

Model Architecture
Similar to Jiménez-Luna et al. 18 we utilize a Siamese network 23 (Figure 1). Siamese networks utilize two arms that share weights and take in two inputs for determining distances between the inputs, often utilized in object matching or object tracking. [24][25][26] Our network takes as input the bound structures of two ligands bound to the same protein, with each arm getting a different protein-ligand complex. We use CNNs as the arms of our Siamese network to learn directly from the 3D information of the bound protein-ligand structure. The bound protein-ligand 3D structures are voxelized utilizing the libmolgrid python library, 27 using the default channels provided by the library. The inputs are then passed through the main convolutional architectures employed by Gnina, 28 Default2018 or Dense, as defined in Francoeur et al. 13 . The Default2018 convolutional architecture uses a series of convolutions and average pooling operations to discern information directly from bound protein-ligand complexes while minimizing computational cost. The Dense convolutional architecture uses a series of densely connected convolutional blocks 29 to enhance the propagation of information at the cost of increased computation. Both of these convolutional architectures demonstrate an ability to learn absolute binding affinity directly from the 3D bound structure of the protein-ligand complex. We remove the final linear layers from both architectures in order to access the final latent vector of the networks. The difference of the latent vectors of the two protein-ligand complexes is used to learn a linear mapping to the RBFE (∆∆G) of the two inputs. We also utilize the latent vectors of each input before taking the difference to determine the absolute binding affinity of each input using a fully connected layer. We train our model using a linear combination of loss components ( Figure 1): where α, β, γ, δ ∈ R + . During the training of our model we set α = 10 and β, γ, δ = 1. L ∆∆G is the mean square error (MSE) of the RBFE prediction. L ∆G is the MSE of the absolute binding affinity prediction for both inputs. L rotation is the MSE of the latent vectors of two randomly rotated versions of each protein-ligand pair. This component encourages the latent space representation to ignore the rotation of the protein-ligand complex. L consistency is the MSE of the difference between the predicted absolute binding affinities and the pre-dicted RBFE, to ensure that the model is providing consistent predictions. The Default2018 architecture's weights are initialized with the Xavier uniform method 30 and the biases are initialized to zero. The Dense model is initialized with weights and biases learned from its training described in Francoeur et al. 13 . All models are trained using the Adam stochastic gradient descent optimizer 31 with the default parameters (β 1 = 0.9, β 2 = 0.999, = 1×10 −8 ).
Models are trained for 1000 epochs with a learning rate of 0.000367 and a scheduler that reduces the learning rate by a factor of 0.7 whenever the loss plateaus for more than 20 epochs. Data augmentation is achieved by randomly rotating and translating the inputs with a maximum translation of 2Å from the center of mass of the ligand.

Retrospective lead optimization evaluation
In order to directly compare our trained model to the model developed by Jiménez-Luna et al. 18 , we utilize the additional ligands training set as described in their manuscript. We train our models on the reference ligand, as described in Data, and a given number of additional ligands. In the one additional ligands training set, we train on the two ordered pairs of the reference ligand and one additional ligand. Then, testing is carried out on the two-permutations between the ligands in the training set and ligands that the model has not seen, see

External Datasets Evaluation
We evaluate the generalizability of our model when applied to unseen data by training

Ablation Study
We probe the performance of our model in relation to the components of the loss function as well as the architecture of our model. Using the one additional ligand training and testing sets, we investigate the average performance of 25 models as we disable aspects of the model.

Protein Family Generalization Evaluation
Lead   The models demonstrate a continual increase in performance as they are given more training information about the congeneric series. We find that the high parameter Dense model does better with lower amounts of congeneric series comparisons than the lower parameter, Default2018, model. The difference between the performance of the two CNN architectures decreases as more information is added to the training set of the models.

External Datasets Evaluation
The RBFE prediction of the model varies widely across the different test sets when no finetuning is performed. However, with increasing amounts of finetuning we see that the correlation with experimental affinity increases and the error decreases (Figure 4 show almost no improvement when adding more finetuning information. CDK2 shows nearly perfect correlation and zero error with no finetuning performed, likely due to the high ligand similarity to the BindingDB dataset (Table S3)

Ablation Study
Removing L ∆∆G does not significantly decrease the performance of the RBFE predictions (Table 1), but increases the correlation and reduces the error of the absolute affinity prediction (Table S1). However, removing L ∆G or L consistency drops the performance of the RBFE predictions by a considerable margin. The removal of L rotation has little effect on the performance of the network, indicating that data augmentation may be all that is required to provide the necessary rotational invariance. When we remove L ∆∆G and L consistency , the  Siamese network no longer provides predictions that are correlated with the experimental affinity values, however, the errors of the predictions are only slightly increased from the baseline.
Altering the Siamese network architecture does not affect performance as much as removing components of the loss function. If we exchange the latent space subtraction of the Siamese network for a concatenation, we do not see any change in performance of the model for RBFE prediction. However, we find that when training the Siamese network with latent space subtraction on only one ordering of each ligand pair, the network is better able to comprehend the negation of the ∆∆G when the ligand ordering is reversed (Table S2). This demonstrates that the latent space subtraction better embeds the symmetry of the ∆∆G prediction problem in comparison to latent space concatenation. We train a single-arm convolutional architecture to predict the absolute affinity values using the same training set (No Siamese Network in Table 1). The single-arm convolutional architecture's absolute affinity predictions are subtracted for pairs of ligands to produce RBFE predictions. The single-arm convolutional architecture is worse than the CNN Siamese network at both absolute and relative binding affinity prediction.

Generalization to new Protein Families
When the Default2018 Siamese network is trained on all of the BindingDB dataset, excluding the left out protein family, and evaluated on the left out protein family, we find that the average RBFE prediction correlation across all of the protein families is nearly zero.
( Figure 6). The variance across protein families is quite large, with some protein families having near perfect predictions and other protein families having extremely poor predictive performance. However, it is important to note the L consistency encourages the model to make the ∆∆G prediction identical to the differences in absolute binding affinity predictions which will provide some notion of ∆∆G loss. The removal of the L rotation component did not significantly change the performance of the model, which may indicate some isotropic properties of the network architecture. The latent space structure imposed by the subtraction operation did not result in improved performance when using both ligand orderings for training. However, training on only one ordering of ligand pairs demonstrates that latent space subtraction is able to embed the RBFE ligand ordering symmetry. This symmetry embedding can be learned by a network using latent space concatenation when using the freely available other orderings during training. The Siamese architecture enables understanding of ordering within congeneric series, which is ignored when only training for absolute affinity prediction. Without the Siamese architecture, the performance of the model suffers on both relative and absolute binding affinity prediction. This may be due to both the symmetry embedding of the network architecture and the increased regularization of the latent space that the Siamese architecture imposes.
The RBFE model has difficulty generalizing to new datasets. We find that our model does not perform as well as the Siamese network proposed by Jiménez-Luna et al. 18 when evaluated on the external datasets from Mobley et al. 22 and Wang et al. 7 , in most cases.
We find high protein and ligand similarity between the BindingDB Congeneric Series set and the external datasets from Mobley et al. 22 and Wang et al. 7 . We provide the minimum protein distance, determined via a global alignment with no parameters and no gap penalties (Biopython.align.globalxx), and highest ligand similarity, determined with RDKit's FingerprintSimilarity function in Table S3. BACE has the minimum protein distance with a protein in the BindingDB dataset and both CDK2 and PTP1B have greater than 95% of their ligands in the BindingDB dataset. This is likely why the correlation and error on CDK2 are nearly perfect with no finetuning. We would expect similar results for PTP1B since it shares a nearly identical minimum protein distance and similar percentage of ligands found in the BindingDB set, but PTP1B has lower correlation and higher error than CDK2 in the no-shot evaluation. Our models do not show the same level of RBFE prediction correlation as Jiménez-Luna et al. 18 , however, the RMSE of the predictions is about the same or less. Correlation is a poor predictor of relative binding affinity performance, due to the low range of affinities in a congeneric series. 7,33 If, for instance, each ligand in a congeneric series has an identical affinity value, then there is no way to measure a predictive Pearson's R correlation. Therefore, it is difficult to determine if the model built by  demonstrates more generalizability than our models. When we evaluate our Siamese network on the more difficult Schindler et al. 2 dataset, we again see much variability in our models performance across the different targets. Our model is unable to match the performance of the FEP+ model in correlation of prediction ( Figure 5, S5, and S6) without using the largest amount of finetuning we explored. Examining the RMSE of the RBFE predictions shows the Siamese network outperforming the FEP+ method on all of the targets, when all of the finetuning data is introduced. The lower correlation and lower error are due to our Siamese network predicting values around the mean of the data. FEP predictions are not dependent on the labels of the data, since there is no training necessary, and therefore would eliminate these sort of predictions.
Despite good intra-congeneric series performance, our Siamese network does not generalize well to new protein families suggesting the approach is best used later in the lead optimization process. We do show that adding information on the left out protein family to the training set improves the performance of the RBFE predictions. However, noticeable improvements would require the experimental binding affinity determination of at least four ligands for the new protein family.
Work still needs to be done on both absolute and relative binding affinity predictors to ensure that they are learning robust models of the intermolecular interactions. Future work should focus on including additional symmetries involved in RBFE in the predictive models, such as cycle closure. 34,35 Including the L consistency term, focused on the symmetry of the relative and absolute binding affinity prediction, was able to increase the performance of our RBFE predictions, therefore including higher order symmetries of the RBFE problem may enhance the performance of future models. Rotational symmetries of the inputs can be addressed with SE(3)-equivariant convolutions, 36 rather than a rotational loss. Adding uncertainty quantification [37][38][39] to the predictive model could enable large performance improvements with fewer ligands during finetuning by focusing on the ligands with the greatest uncertainty according to the RBFE model.

Conclusion
Convolutional Siamese networks are capable of RBFE prediction (Figure 3). We find that higher capacity CNN models used in the arms of the Siamese network increases the predictive performance of the model. Our multitask loss is able to boost the performance of the RBFE prediction in comparison to only calculating a loss on the RBFE (Table 1). This indicates that RBFE prediction is aided by increased regularization of the CNN latent space. The latent space subtraction of the Siamese network is able to implicitly embed the reverse symmetry of the RBFE prediction. However, the reverse symmetry is learnable without the latent space subtraction when the model is trained on both orderings of ligands for RBFE prediction. We note that our convolutional Siamese network's performance is less  ∆∆G Shared Weights