1 Benchmarking computational methods to calculate the octanol / water partition coefficients of a diverse set of organic molecules

In the discovery process of new drugs and the development of novel therapies in medicine, computational modeling is a complementary tool for the design of new molecules by predicting for example their solubility in different solvents. Here, we benchmarked several computational methods to calculate the partition coefficients of a diverse set of 161 organic molecules with experimental logP values obtained from the literature. In general, density functional theory methods yielded the best correlations and lower average deviations. Although results are obtained faster with semiempirical and molecular mechanics methodologies, these methods yielded higher average deviations and lower correlation coefficients than hybrid density functional theory methods. We recommend the use of an empirical formula to correct the calculated values with each methodology tested.


Introduction
Over the past decades, an understanding of the lipophilicity of organic molecules has been essential in designing and synthesizing water-soluble drugs. Moreover, once the solvent solubility is determined, a correlation of the pharmacokinetics with the molecular structure needs to be established for the new drug submitted for clinical trials. 1 Besides the development of profiling techniques based on the Lipinski Rule 2 used in drug discovery 3 , determination of the partition coefficient (P), reported as the base 10 logarithm of P (logP, see equation 1), has been a convenient strategy to provide information about a molecule's lipophilicity. The partition coefficient (P) of a substance is defined as the ratio of its concentration in organic and aqueous phases when the system is in equilibrium. 4 15 to the test set used by Nedyalkova et al. 13 We proceeded to performed calculations with DFT, semiempirical and molecular mechanics (MM) computational methods to benchmark these methodologies. We found the best correlation to be obtained with wB97XD/6-311+G(2d,p).

I. Test set selection
A total of 161 molecules with a variety of functional groups and experimental n-  (Table S1) with each computational methodology discussed below. A summary of the results can be found in Table 1. Scatter plots of correlations 1, 6, 7 and 11 are shown in

II. LogP determination with DFT and semiempirical methods
All geometry optimizations and frequency calculations were performed with Gaussian 09 18 using the solvent based on density (SMD) implicit solvent model for water or noctanol. SMD has been reported to yield the best results in comparison to other continuum based solvent models. 14 The DFT hybrid methods benchmarked with this test set were B3LYP, M062X, and wB97xD functionals with either the 6-31G(d) or 6-311+G(2d,p) as basis sets. The semiempirical methods benchmarked were AM1, PM3, and PM6.
To calculate the logP values for each molecule, we used the Gibbs free energy of the molecules optimized in water and subtracted it from the Gibbs free energy of the molecules optimized in n-octanol (∆Go/w). Then, logP was calculated according to:

III. LogP determination with MM methods
Molecular mechanics (MM) methods were carried out using Schrodinger's Maestro 12.5 software and MacroModel tools. The force fields benchmarked were OPLS3e, OPLS, and OPLS2005. The 161 molecules were re-optimized with each of the tested force fields in the gas phase. The OPLS potential parameters used were the dielectric constant as the electric treatment, and charges from the force field. Extended non-bond cutoffs of 8.0, 20.0, and 4.0 Å were used as the parameters for van der Waals, electrostatics, and H-bond, respectively. 19 After re-optimization, the potential parameters were adjusted using noctanol as the primary solvent, whereas the comparison parameters were activated using the "logP estimation" tool. Water was set as the secondary solvent, to determine the partition coefficients.  As shown in Table 1 Moreover, since each force field has a particular training set, the nature of the training set must be considered. For example, OPLS and OPLS3e are force fields preferred for biopolymers and carbohydrates, whereas OPLS2005 is the preferred force field for biological systems and organic molecules. 24 This could explain the higher correlation coefficient obtained with OPLS2005. optimized using B3LYP/6-31G(d). This is an example of a case in which higher energy structures (left) yield better results than lower energy structures (right).

Results and Discussion
In the context of conformational structures, using the lowest energy structures was assumed to be crucial to obtain the best computational results as it has been reported by Ho et al. 14 Nevertheless, we found a couple of cases where using higher energy structures yielded better agreement between the calculated and experimental values. One example is shown in Figure 3, where using the higher energy structures (left) resulted in a lower deviation from the experimental logP than using the lower energy structures (right). These results suggest the possibility of improving the correlations by using a Boltzmann population distribution weighted approach 25 to calculate logP values.

Conclusions:
Computational chemistry is a key tool to study the solubility of molecules in complex systems. Therefore, it is important to benchmark the accuracy of computational methods to enable direct comparison to experimental results, and prediction of the lipophilicity and other properties of new compounds.
We tested the accuracy of three DFT functionals, three semiempirical methods, and three molecular mechanics models. Among the methods benchmarked, the most efficient was wB97xD/6-311+G(2d,p). Semiempirical and molecular mechanic methods yielded reasonable correlations, but hybrid DFT methods are recommended for more precise results. To correct systematic errors, we suggest using the correction logPcorrected = mlogPcalculated + b (where m and b are the slope and intercept found in Table 1, respectively) for calculations done with each methodology presented in Table 1.
These results provide the benchmarking of a diversity of molecules and methodologies that will be useful for future studies in the design and characterization of complex systems such as large molecules, surfactants, and drugs.

Declaration of Interest
The authors declare no competing interests.