## Abstract

Predicting binding free energy of ligand-protein complexes has been a grand challenge in the field of computational chemistry since the early days of molecular modeling. Multiple computational methodologies exist to predict ligand binding affinities. Pathway-based Free Energy Perturbation (FEP), Thermodynamic Integration (TI) as well as Linear Interaction Energy (LIE), and Molecular Mechanics-Poisson Boltzmann/Generalized Born Surface Area (MM-PBSA/GBSA) have been applied to a variety of biologically relevant problems and achieved different levels of predictive accuracy. Recent advancements in computer hardware and simulation algorithms of molecular dynamics and Monte Carlo sampling, as well as improved general force field parameters, have made FEP a principal approach for calculating the free energy differences, especially when calculating the host-guest binding affinity differences upon chemical modification.

Since the FEP-calculated binding free energy difference, denoted ddGFEP only characterizes the difference in free energy between pairs of ligands or complexes, not the absolute binding free energy value of each individual host-guest system, denoted dG, we examine two rarely asked questions in FEP application:

1) Which values would be more appropriate as the prediction to assess the ligands prospectively: the calculated pairwise free energy differences, ddGFEP, or the estimated absolute binding energies, d^G, transformed from ddGFEP?

2) In the situation where only a limited number of ligand pairs can be calculated in FEP, can the perturbation pairs be optimally selected with respect to the reference ligand(s) to maximize the prediction precision?

These two questions underline the viability of an often-neglected assumption in pairwise comparisons: that the pairwise value is sufficient to make a quantitative and reliable characterization of an individual ligand's properties or activities. This implicit assumption would be true if there was no error in each pairwise calculation. Recently pair designs such as multiple pathways or cycle closure analyses provided calculation error estimation but did not address the statistical impact of the two questions above. The error impact is fully minimized by conducting an exhaustive study that obtains all NC2 = N(N-1)/2 pairs for a set N molecules; more if there is directionality (dGi,j != dGj,i). Obviously, that study design is impractical and unnecessary. Thus, we desire to collect the right amount of data that is 1) feasibly attainable, 2) topologically sufficient, and 3) mathematically synthesizable so that we can mitigate inherent calculation errors and have higher confidence in our conclusions.

The significance of above questions can be illustrated by a motivating example shown in Figure 1 and Table 1, which considers two different perturbation graph designs for 20 ligands with the same number of FEP perturbation pairs, 19, and the same reference, Ligand 1. These two designs reached different conclusions in rank ordering ligand potencies due to errors inherent in the FEP derived estimates. Based on design A, ligands 5, 7, 14, 15 would be selected as the best four (20%) picks since those d^G estimates are the most favorable. Design B would yield ligands 5, 12, 18, 19 as best for the same reason. Without knowing the true value, dGTrue of the other 19 ligands, we lack a prospective metric to assess which design could be more precise even though, retrospectively, we know that both designs had reasonably good agreement with the true values, as measured through correlation and error metrics. However, the top picks from neither design were consistent with the true top four ligands, which are ligands 7, 10, 12, 18. Yet, if all of the 20C2 =190 pairs could have been calculated as listed in the last column of Table 1, the best four ligands would have been correctly identified. Additionally, the other metrics included in Table 1 were significantly improved. However, as mentioned above, calculating all possible pairs, or even a significant fraction of all possible pairs, is unlikely in practice, especially when number of molecules are large. Given this restriction, is it possible to objectively determine whether design A or B will give more precise predictions?

In this report, we investigated the performance of the calculated ddGFEP values compared to the pairwise differences in least squares derived d^G estimates both analytically and through simulations. Based on our findings, we recommend applying weighted least squares to transforming ddGFEP values into d^G estimates. Second, we investigated the factors that contribute to the precision of the d^G estimates, such as the total number of computed pairs, the selection of computed pairs, and the uncertainty in the computed ddGFEP values. The mean squared error, denoted MSE and Spearman's rank correlation, are used as performance metrics.

To illustrate, we demonstrated how the structural similarity can be included in design and its potential impact on prediction precision. As in the majority of reported FEP studies on binding affinity prediction, the ddGFEP pairs were selected based on chemical structure similarity. Pairs with small chemical differences are assumed to be more likely to have smaller errors in ddGFEP calculation. Together using the constructed mathematic system and literature examples, we demonstrate that some of pair-selection schemes (designs) are better than the others. To minimize the prediction uncertainty, it is recommended to wisely select design optimality criterion to suit

practical applications accordingly.

Since the FEP-calculated binding free energy difference, denoted ddGFEP only characterizes the difference in free energy between pairs of ligands or complexes, not the absolute binding free energy value of each individual host-guest system, denoted dG, we examine two rarely asked questions in FEP application:

1) Which values would be more appropriate as the prediction to assess the ligands prospectively: the calculated pairwise free energy differences, ddGFEP, or the estimated absolute binding energies, d^G, transformed from ddGFEP?

2) In the situation where only a limited number of ligand pairs can be calculated in FEP, can the perturbation pairs be optimally selected with respect to the reference ligand(s) to maximize the prediction precision?

These two questions underline the viability of an often-neglected assumption in pairwise comparisons: that the pairwise value is sufficient to make a quantitative and reliable characterization of an individual ligand's properties or activities. This implicit assumption would be true if there was no error in each pairwise calculation. Recently pair designs such as multiple pathways or cycle closure analyses provided calculation error estimation but did not address the statistical impact of the two questions above. The error impact is fully minimized by conducting an exhaustive study that obtains all NC2 = N(N-1)/2 pairs for a set N molecules; more if there is directionality (dGi,j != dGj,i). Obviously, that study design is impractical and unnecessary. Thus, we desire to collect the right amount of data that is 1) feasibly attainable, 2) topologically sufficient, and 3) mathematically synthesizable so that we can mitigate inherent calculation errors and have higher confidence in our conclusions.

The significance of above questions can be illustrated by a motivating example shown in Figure 1 and Table 1, which considers two different perturbation graph designs for 20 ligands with the same number of FEP perturbation pairs, 19, and the same reference, Ligand 1. These two designs reached different conclusions in rank ordering ligand potencies due to errors inherent in the FEP derived estimates. Based on design A, ligands 5, 7, 14, 15 would be selected as the best four (20%) picks since those d^G estimates are the most favorable. Design B would yield ligands 5, 12, 18, 19 as best for the same reason. Without knowing the true value, dGTrue of the other 19 ligands, we lack a prospective metric to assess which design could be more precise even though, retrospectively, we know that both designs had reasonably good agreement with the true values, as measured through correlation and error metrics. However, the top picks from neither design were consistent with the true top four ligands, which are ligands 7, 10, 12, 18. Yet, if all of the 20C2 =190 pairs could have been calculated as listed in the last column of Table 1, the best four ligands would have been correctly identified. Additionally, the other metrics included in Table 1 were significantly improved. However, as mentioned above, calculating all possible pairs, or even a significant fraction of all possible pairs, is unlikely in practice, especially when number of molecules are large. Given this restriction, is it possible to objectively determine whether design A or B will give more precise predictions?

In this report, we investigated the performance of the calculated ddGFEP values compared to the pairwise differences in least squares derived d^G estimates both analytically and through simulations. Based on our findings, we recommend applying weighted least squares to transforming ddGFEP values into d^G estimates. Second, we investigated the factors that contribute to the precision of the d^G estimates, such as the total number of computed pairs, the selection of computed pairs, and the uncertainty in the computed ddGFEP values. The mean squared error, denoted MSE and Spearman's rank correlation, are used as performance metrics.

To illustrate, we demonstrated how the structural similarity can be included in design and its potential impact on prediction precision. As in the majority of reported FEP studies on binding affinity prediction, the ddGFEP pairs were selected based on chemical structure similarity. Pairs with small chemical differences are assumed to be more likely to have smaller errors in ddGFEP calculation. Together using the constructed mathematic system and literature examples, we demonstrate that some of pair-selection schemes (designs) are better than the others. To minimize the prediction uncertainty, it is recommended to wisely select design optimality criterion to suit

practical applications accordingly.

## Content

## Supplementary material

SupportingInfo