Improving force ﬁeld accuracy by training against condensed phase mixture properties

Developing a suﬃciently accurate classical force ﬁeld representation of molecules is key to realizing the full potential of molecular simulation as a route to gaining fundamental insight into a broad spectrum of chemical and biological phenomena. This is only possible, however, if the many complex interactions between molecules of diﬀerent species in the system are accurately captured by the model. Historically, the intermolecular van der Waals (vdW) interactions have primarily been trained against densities and enthalpies of vaporization of pure (single-component) systems, with occasional usage of hydration free energies. In this study, we demonstrate how including physical property data of binary mixtures can better inform these parameters, encoding more information about the underlying physics of the system in complex chemical mixtures. To demonstrate this, we re-train a select number of the Lennard-Jones parameters describing the vdW interactions of the OpenFF 1.0.0 (Parsley) ﬁxed charge force ﬁeld against training sets composed of densities and enthalpies of mixing for binary liquid mixtures as well as densities and enthalpies of vaporization of pure liquid systems, and assess the performance of each of these combinations. We show that retraining against the mixture data improves the force ﬁeld’s to reproduce both pure and mixture properties, reducing some systematic errors that when training vdW interactions against properties of pure systems


Introduction
Atomistic molecular simulations are a popular and effective method for examining biomolecular systems in silico, revealing molecular insights in protein folding, protein-ligand binding, membrane transport, and many other phenomena. For many of these use cases, quantitative accuracy is required for meaningful predictions. One critical example is binding free energy calculations for protein-ligand compounds. These calculations are an important step in the computational drug discovery process, but are only useful to medicinal chemists if predictions are sufficiently accurate and rapid [1]. Consequently, there has been much interest in producing improved parameter sets for the simple fixed charge functional forms common to most modern force fields. One key type of parameters are the parameters that specify the Lennard-Jones (LJ) interaction terms, which are used in standard organic and biomolecular force fields to capture the short-range attractive and repulsive non-bonded interactions that drive many important biomolecular processes.
The simplest method for obtaining LJ parameters is estimation from experimental correlations [2], as in the original CHARMM [3] and GROMOS [4] force fields. This method has a low computational overhead but very limited accuracy. Training LJ parameters against experimental properties is more computationally expensive, but became the predominant method in small molecule force fields, facilitated by the increase in computational power required to simulate those properties. This method has been used by many force fields, including OPLS [5], CGenFF [6], GAFF [7], and GROMOS [8]. The dominant parameterization paradigm is to train the LJ parameters against liquid density ( ) and heat of vaporization (Δ ) measurements, as in the original OPLS parameterization by Jorgensen et al. [9]. These two physical property targets are used because they are simple to calculate from simulation [10], are dependent on the molecular volume and attractive forces, and together constrain the LJ and parameters better than they do individually. We note that while this is the dominant choice, alternatives exist; notably, the GROMOS 53A5/53A6 force fields use enthalpies of hydration and solvation in addition to and Δ [8]. Additionally, ab initio calculations can be used to inform parameterization, for example, using rare-gas interaction energies and geometries to produce initial parameter estimates subsequently refined with physical property data [11,12]. More recently, methods to produce LJ parameters entirely from ab initio data, using atom-in-molecule electron density partitioning [13,14], or the exchange hole dipole model [15] have been proposed. Still, parameterization against small molecule and Δ is the dominant paradigm [16,17]. Training against Δ in particular has some problematic aspects. Using fixed charge force fields, predictions of Δ require performing simulations in both liquid phase and gas phase, which means that the same parameters must capture two different polarization states [18,19] to reproduce experimental measurements of Δ . There has been significant discussion on how to account for this polarization cost, which also arises in the calculation of hydration and solvation free energies [19][20][21]. Methods suggested include calculating an explicit polarization cost [19] or using semi-polarized charges [13,22], but the issue has not been definitively resolved. Additionally, some compounds, such as acids, can form clusters in the gas phase [23,24], which are not generally represented in gas phase simulations used to predict Δ . Another major issue is the availability of modern experimental Δ data. The NIST ThermoML Archive [25] is the one of the largest open databases for physical property measurements, and contains roughly 500 total Δ data points, where a "data point" in this context is defined as an experimental measurement for a specific compound at a given temperature , pressure , and mole fraction . In contrast, the ThermoML Archive contains over 60,000 measurements of pure densities. The ThermoML Archive is certainly not the only location of Δ data (it lacks data prior to the year 2003, and many measurements of Δ date to the mid-20th century), but it is challenging to obtain uncertainty estimates [26], rigorous provenance [27], or fully computer-readable forms for these older measurements. This makes it difficult to systematically vet the experimental procedures and outputs when curating large scale datasets for parameter optimizations.
For a fixed charge small molecule force field geared towards biomolecular systems in heterogeneous condensed phase, properties of mixtures such as the densities ( ( )) and enthalpies of mixing (Δ ( )) of binary mixtures are an attractive alternative to the properties of pure systems for several reasons: 1. Properties of mixtures, especially in the cases of mixtures that deviate strongly from ideality, can be sensitive to interactions between functional groups that are not generally present in the pure substances used to train LJ parameters [28,29]. This is especially important for capturing solute-solvent interactions. 2. The nature of mixture data allows users to more easily include a diverse spectrum of interactions in their training sets. For example, mixtures of drug-like molecules with pharmaceutically relevant solvents or amino acid analogues can in principle be readily included in training sets to allow the LJ parameters of solvents, ligands, and bio-polymers to be self-consistently trained. 3. Although computing some properties of mixtures may require multiple simulations, most such properties (including those studied here) do not require simulations in different phases, minimizing error caused by polarization differences. There may be some difference in polarization of molecules between more polar and less polar liquids, but this difference is significantly less than the difference between two phases, especially since liquid mixtures are, by definition, miscible and the components must therefore have dielectric constants that are not completely dissimilar. 4. Including mixture data adds the ability to vary training set data by composition; data points can be selected at ( , , ) rather than just ( , ), probing the balance between pure and mixture interactions. 5. Many data points for mixture properties are available in modern sources such as the ThermoML Archive. In particular, binary Δ ( ) measurements are much more abundant in the ThermoML archive compared to pure Δ . For the moieties and conditions of interest in our study, there are 382 binary mixtures with Δ ( ) measurements (generally available at multiple concentrations), compared to 24 single-component Δ measurements that fit the same criteria. For density measurements, both mixture and pure component data points are relatively abundant, with 4000 data points for pure substances and 900 binary mixtures matching our criteria.
In this study, we aim to rigorously assess whether it is more beneficial to train the intermolecular LJ parameters of a force field on solely pure substance data, binary mixture data, or a combination of both, with an emphasis here on density-related properties ( , ( )) and enthalpic properties (Δ , Δ ( )). A combination of density and enthalpic data should be generally sufficient to constrain the LJ and parameters, with densities providing the most information about and enthalpic properties providing information on via the cohesive forces between molecules, though there is of course some partial cross-correlation between parameters [30].
Starting with the OpenFF 1.0.0 (Parsley) force field [31], we use this data to train 12 Lennard-Jones parameters ( and for 6 LJ types) against data for alcohols, esters, ethers, ketones, acids, and alkanes, with property measurements chosen from four training sets containing different combinations of physical properties. To test the performance of the refitted force fields, we benchmark the results of this optimization against a larger test set of physical property measurements for the same moieties, consisting of ( ), Δ ( ), , and Δ measurements.

Optimization strategy
The studies proposed are constructed with the following workflow, as shown in Figure 1.
1. Sourcing a training set of molecules and selecting particular measurements for each molecule (or pair of molecules) of interest. 2. Optimizing only the selected LJ parameters against the training set using ForceBalance [32] in combination with the OpenFF Evaluator framework [33], starting from the OpenFF 1.0.0 (Parsley) [34] force field parameters. 3. Assessing the performance of the trained force field against a test set of measurements using the OpenFF Evaluator framework.
The goal of the study was to assess whether training the LJ parameters against properties of mixtures, as well as combinations of pure/mixture properties, is more beneficial than training against properties of pure systems. Other force field parameters, namely the valence and electrostatic parameters, were not optimized.

Organic Mixture Studies
We selected four combinations of physical property data types (densities of pure compounds and binary mixtures, heats of vaporization of pure compounds, and enthalpies of mixing of binary mixtures) to optimize against (shown in Table 1). A training dataset consisting of physical property measurements for organic molecules is selected from the NIST ThermoML database. Starting with the OpenFF 1.0.0 (Parsley) force field, the physical properties in the training dataset are estimated using the force field and the OpenFF Evaluator software package.LJ parameters are then adjusted by minimizing the difference between the simulation results and experimental training data via a regularized least-squares procedure as implemented in the ForceBalance package [32] Properties Included , binary mixture densities ( ( )), and binary enthalpies of mixing (Δ ( )). These measurements cover a set of alcohols, esters, ethers, ketones, acids and alkanes, which is further described in Figures 2 and 3. The 4 training sets in this study are labeled based on which of these measurements are included, and are described in Section 2.1.1.

( , Δ
) ("pure only"): Includes only density , and enthalpy of vaporization Δ , data points. This is the type of training set which has most commonly been used [5][6][7] for training the non-bonded interaction force field parameters, and is therefore included as a historical baseline. 2. (Δ ( ), ( )) ("mixture only"): Includes only density ( ) and enthalpy of mixing Δ ( ) data points measured for binary mixtures. This data set allows us to explore whether mixture data alone is sufficient to constrain the non-bonded force field parameters during training, and if force field trained without pure compound data points will be able to accurately reproduce pure compound data. 3. (Δ ( ), ( ), ) ("mixtures + pure density"): A combination of ( ), Δ ( ), and data points. This extension of the "mixture only" training set is included to explore whether including the density of pure systems helps to constrain the optimization, or whether ( ) alone is sufficient. 4. (Δ ( ), ( ), , Δ ) ("pure and mixture"): A combination of the "pure only" and the "mixture only" training sets. This data set tests whether including pure Δ alongside Δ ( ) improves the parameterization of the cohesive energies between molecules, or whether Δ ( ) alone is sufficient.
The measurements in the training set are for molecules composed of carbon, hydrogen and oxygen only (including alcohols, esters, ethers, ketones, acids and alkanes). These compounds cover a wide range of fluid phase polarizabilities, with relative permittivities ranging from 1.9 (hexane [35]) to 35.7 (methanol [36]).

Data set selection
All training sets considered here are composed of only alcohols, esters, ethers, ketones, acids and alkanes that have ample density and enthalpic data available, and contain only data points measured at nearambient conditions (288.15-323.15K, 0.95-1.05 atm). This set of moieties, containing only carbons, hydrogens and oxygens, was chosen to limit the scope of the study and focus specifically on the choice of training data for a set of molecules. The molecules included exercise a total of 9 LJ types, of which 6 are optimized, shown in Table 2. The three types included that are not optimized are all hydrogen parameters; an explanation of why they are not optimized is given in Section 2.4.1.
We enforce the criteria that all measurements in the data set contain only the molecules in Figure 2. This criteria controls for the identity of molecules used in the optimization; whether the measurements used in fitting are from pure substance or binary mixtures, they are restricted to the same set of molecules. We note that some values for ( ) are obtained through the conversion of ( ) and where ( ) is not directly available.

Pure substance training set
The "pure only" training set is composed of one and one Δ measurement for each of the selected molecules ( Figure 2). These molecules were manually chosen to include a selection of esters, ethers, ketones, alcohols and alkanes which included long/short chain, branched/unbranched, and cyclic/acyclic characteristics, where data was available. The measurements were sourced from the NIST ThermoML [25] archive. The Δ measurements were sourced directly from the literature, as very limited data for the moieties of interest is available in the ThermoML Archive. Many data points were curated from the Majer et al. review [26], where care was taken to select data points which were deemed as reliable by the authors, and for which at least three independent measurements had been made and were in reasonable agreement. In total, 28 molecules were chosen for a total of 56 data points (28 data points  and 28 Δ data points [24,[62][63][64][65][66][67][68][69][70][71][72]). For Δ of acids, measurements were sourced which correspond to an infinitely dilute gas (as computed in [24]), which corresponds to the gas we simulate. This is done because carboxylic acids tend to associate in the gas phase.

Mixture training set
The binary mixtures selected for the mixture training set ( Figure 3) are composed of the molecules included in the pure training set, and were manually chosen to include a diverse set of interactions. These property measurements were sourced directly from the NIST ThermoML [25] archive using the OpenFF Evaluator's built-in data selection tools. Where available, three ( ) and three Δ ( ) data points were included for each binary mixture, one each at 25%, 50%, and 75% composition, or as close to these values as possible given data availability. These compositions were chosen so as to ensure that the set included both compo-nents in excess to the other as well as in close to equal amounts. Compositions between 25-75% should capture most of the relevant information, as deviations from ideality for many mixtures are maximized near an equal mixture. Mixtures with compositions close to pure (e.g. > 0.9) were excluded, as when the concentration of one component becomes small, our simulation boxes (1000 total molecules) would have a very low number of molecules of that component. In total, measurements made for 33 binary mixtures were selected for a total of 195 data points. This is significantly more than the 56 total data points in the pure data set, but it is drawn from a number of mixtures similar to the number of compounds in the pure training set. We note that we discovered that one Δ ( ) data point (described in supporting information, Section 4) used in the mixture training set that was transcribed into ThermoML incorrectly after training was complete. The mixture data used in our training sets contains one ( ) and one Δ ( ) measurement per mixture for three different compositions if multiple compositions were available (close to 25%, 50% and 75%) measured at close to ambient conditions (P=1 atm, T=298 K), yielding a training set of 33 binary mixtures with 187 data points total.

Test set
The test set was chosen to include measurements of ( ), Δ ( ), , and Δ data points as in the training set. Unlike the training set, we do not require that all pure substance and binary mixture measurements in the test set must be sourced from the same set of molecules. Instead, given the limited amount of diverse Δ ( ) and Δ data for the selected moieties, focus was given to selecting as diverse a test set as possible which maximally exercised the re-trained parameters. Data points from pure substances included in the training set were excluded from the test set, as well as mixture data points from mixtures included in the test set. The test set did include binary mixtures for which one of the two components was present in the training set; for example, a mixture of ethanol and pentanol would be permissible in the test set even if data points for ethanol/propanol and butanol/pentanol were both included in the training set. This expands the test set to types of mixtures that were not included in the training set; for example, mixtures containing either an alcohol or ketone are in the training set, but alcohol/ketone mixtures are only included in the test set. The set was also selected to contain substances as distinct as possible from the training set, and from other molecules in the test set. Mixtures including carboxylic acids were not included in the test set due to low data availability.
In order to select a maximally diverse test set from the pool of molecules available in the ThermoML Archive, a distance metric based on molecular fingerprints was defined to determine how distinct any two substances are. Then, binary mixtures were selected by a greedy optimization that maximized this distance metric. For a more detailed description of this process, see the Supporting Information Section 1.
The substances included for pure substance ( and Δ ) measurements were then chosen to match the components of the test set mixture properties where available; these were supplemented with measurements for similar molecules that exercise the same LJ parameters. This resulted in a test set consisting of 236 Δ ( ) and 385 ( ) data points, which was supplemented with a hand-selected test set of 29 Δ and 29 pure component measurements.

Physical property simulations
All estimates of the physical property values were performed using the OpenFF Evaluator [33] package version 0.1.0 [73] using the default estimation workflow schemas, which are outlined in detail in the OpenFF Evaluator documentation [74]. Where possible, simulations are reused to calculate physical properties. For example, simulations of a pure liquid phase can be reused in calculations of , Δ and Δ .

Pure Liquid Simulations
Pure liquid properties were calculated by simulation in the NPT ensemble, at the temperature and pressure from the corresponding physical property reference. These were performed with the default OpenFF Evaluator simulation workflow, in which a box of 1000 molecules of the target substance were placed in a simulation box using PackMol [75], with parameters then assigned using the OpenFF Toolkit version 0.6.0 [76]. An energy minimization and 0.2 ns equilibration run were then performed using OpenMM. Subsequently, the molecules were simulated for 2 ns. For all simulations, a Langevin integrator with BAOAB [77] splitting and a 2 fs timestep, and the default OpenMM Monte Carlo barostat, were employed to ensure simulation in the correct NPT ensemble. Uncorrelated and well-equilibrated snapshots were used to compute the ensemble averages of any observables, according to the procedure outlined by Chodera [78]. All uncertainties in the average observables were computed by bootstrapping with replacement, and propagated through any further calculations, assuming a Gaussian error model. Densities are estimated using ensemble averages from these simulations.
Locations of scripts to run the simulations and reproduce the results in this study are available in the Code and Data Availability section.

Enthalpy of Vaporization Calculations
Enthalpies of vaporization require a pure liquid simulation, as described in Section 2.3.1, as well as a gas phase simulation. This gas phase simulation is performed for a single molecule in the NVT ensemble, with periodic boundaries disabled, using the same Langevin integrator as used with the liquid simulations. This simulation is run for 30 ns instead of the liquid phase 2 ns to converge statistics with only a single molecule. Enthalpies are calculated using Equation 1.

Mixture Properties
Mixture densities were simulated with a similar workflow to the pure liquid simulations, but with the molecules in the initial box split proportionally between the two chemical species according to the experimental mole fraction. Densities of binary mixtures are straightforward to calculate as they do not require more than one simulation; the process is the same as for densities of single component liquids. Binary enthalpies of mixing are calculated according to equation 2, where the enthalpies of the individual simulated components are multiplied by their mole fractions in the mixture, and then subtracted from the simulated mixture Δ , 1 , 2 .
Enthalpies used in this calculated were simulated with a set of 3 simulations: one for each pure component, and one for the mixture. Each of these simulations followed the standard workflow for a pure or mixture property. 7 of 21

Optimization
For stochastic gradient descent optimizations, we need to estimate gradients of the observables of interest as a function of force field parameters. In this paper, gradients are calculated using a reweighted finite difference scheme, where the derivative d ∕d of an observable with respect to a parameter is calculated using the central difference method with a relative step size of ℎ = 10 −4 . Values of at − ℎ and + ℎ are estimated using MBAR reweighting [79], which is accurate for the properties of interest over the step size ℎ. All optimizations were performed using the ForceBalance software package using the built-in OpenFF Evaluator target [32,33]. Optimizations were run using the Levenberg-Marquardt [80] non-linear least squares algorithm with adaptive trust radius [81,82] to iteratively minimize the objective function until it was observed to fluctuate around a minimum value in each optimization. This algorithm has been used successfully with ForceBalance for force field optimization previously [32,83]. In all cases 12 iterations was sufficient to meet this criteria. Each iteration consists of 1) estimation of each physical property measurement in the training set using the current force field parameters, 2) comparison of those estimated values to the experimental values in ThermoML, 3) adjustment of the target parameters with the ForceBalance optimizer. A weighted least squares objective function, , was used to measure deviations of the reference and estimated physical property values. An L2 penalty function based on the norm of the parameter displacement vector (from the initial parameters) is used to regularize the optimization, with a prior over the ForceBalance mathematical parameters [32] of 0.1 for and 1.0 for .
where is the number of types of properties (e.g. density, enthalpy of vaporization, etc.), is the number of data points of type , is the experimental value of data point and ( ) is the estimated value of data point using the current force field parameters. The denominator is an inverse weight with the same units as property type chosen so that that each property type contributed approximately equally to the objective function. For example, for the pure training set, ∼ 50% of the objective function value is due to data, and ∼ 50% is due to Δ . This a priori approximation was made as it is unclear that any one type of property should be weighted more than another.

Parameters optimized
Both the training and test sets, each containing only molecules composed of carbon, hydrogen, and oxygen, exercise a total of 18 SMIRNOFF LJ parameters (9 different SMIRKS parameter types with one and per SMIRKS). Of these parameters, 12 were optimized, with the remaining 6 held constant at their initial OpenFF 1.0.0 values. The parameters held constant (all for hydrogens) were not optimized because either the parameter correspond to a specific context that was not sufficiently constrained by the training data set or, in the case of [#1:1]-[#8] (hydroxyl hydrogen), the OpenFF 1.0.0 value is explicitly set to a very small nonzero value ( = 5.27 × 10 −5 ) and not reoptimized. This is a slight modification of the AMBER hydroxyl hydrogen parameter [84] (HO, = 0) to avoid unphysical effects caused by the AMBER parameterization [85]. Here each parameter is uniquely identified by a SMIRKS pattern which encodes the chemical environment to which the parameter will be applied [86]. These parameters, along with brief descriptions, are listed in Table 2.

SMIRKS Pattern Description
Illustration Atoms with Optimized Parameters Hydrogen attached to generic oxygen Table 2. All atoms with LJ parameter types exercised by the training and test sets, categorized by whether they are re-optimized in this study. SMIRKS atom types are applied hierarchically, with more specific types superseding less specific types, as described in Mobley et al. [86]. Each of these atom types has a and parameter that describe the Lennard-Jones interactions; with 6 SMIRKS types included in the optimization, 12 Lennard-Jones parameters were optimized. In the "illustration" figures, any atomic index including a '*' is a wildcard, representing any atom, or group of atoms.

Testing
Tests of force field performance were performed by taking the final force fields produced from each optimization and estimating each data point in the test set using OpenFF Evaluator. All property calculations were made using the same property prediction workflows from section 2.3. Estimated properties for each data point were then compared against the experimental values, with RMSE and average Kendall rank correlation [87] values calculated against the test set for each force field.

Parameter Changes
The objective function was observed to decrease by 50-70% for each of the four optimizations performed, indicating improvements against the training set in all cases (see Supporting Information Section 3.1). This improvement was achieved with relatively small changes in the target parameters, as most of the refitted parameters changed only slightly from their initial values, varying less than 5% in most cases (Figure 4). A notable exception is for [#1:1]-[#6X4] (hydrogen attached to tetravalent carbon), which changes up to 40% depending on the optimization. We also note that the for [#8X2H1+0:1] (hydroxyl oxygen) changes much more when trained against mixture data (-0.4 % for "pure only" vs. -1.7-2.8% for sets containing mixture data).

Figure 4. The 4 different training sets generally drive the parameters in the same direction and to similar magnitudes, indicating all data sets encode somewhat similar parameter information. Changes in parameter values
for each of the training sets considered in this paper are shown as bar graphs above. The percent change in the each parameter for each of the training sets relative to their starting value taken from the OpenFF 1.0.0 force field. One notable difference between the "pure only" set and the sets containing mixtures is the [#8X2H1+0:1] (hydroxyl oxygen) parameter, which is reduced by only 0.4% in the "pure only" (orange) set, but reduced by 1.7-2.8% in the other sets.

Training Set Property RMSE
We examine the performance of the trained force fields on the training set, as well as the changes in parameters after optimizations. This detailed look at the optimization sheds light on which parameter changes are driving the specific property improvements that result in an improved force field. Using the RMSE for each target property as a metric and grouping by property and chemical environment, it is clear that most of the different moieties in the training set are improving when trained against either pure or mixture data. This is evident when training against both the "pure only" data set in Figure 5 and the "mixture only" data set in 6. Improvements in both pure and mixture training data for the other two (mixed) optimizations were also observed, which are shown in supporting information (Section 3.4.2,3.5.2).  ( ) (right panel) measurements in the "mixture only" training set, estimated using the initial parameters (OpenFF 1.0.0, blue points) and the final parameters after 12 iterations ("mixture only", orange points). RMSEs are categorized by chemical environment, where "Ether > Ketone" denotes a mixture with ether molecules in excess of ketone molecules, and "Ether ≈ Ketone" denotes a mixture with ether and ketone molecules in roughly equal compositions, etc. Error bars represent 95% confidence intervals computed by bootstrapping with replacement for 1000 iterations. The results from the other training sets containing mixture properties ("mixtures and pure density", "pure and mixture") show statistically equivalent improvements in training set RMSEs, and are available in Supporting Information Sections 3.4.2 and 3.5.2.
One notable exception is ketones, as pure ketone densities and "Ketone > Ether" binary densities were both degraded upon training. Given that this occurs for both pure and mixture training data, it is unlikely that it is a symptom of the training sets selected. We also note that ketone Δ RMSEs are improved, alongside both densities and Δ RMSEs for esters, which utilize the same [#8:1] generic carbon param-eter. It is likely that these properties are improved at the expense of ketone densities. By examining the first derivatives of the density contribution to the objective function with respect to the force field parameters, again partitioned by moiety (Figure 7), we see that modifying the [#1:1]-[#6X4] (hydrogen attached to tetravalent carbon), [#6X4] (tetravalent carbon), and [#8:1] (generic oxygen) has an opposite effect on ketone objectives compared to the objective for other moieties. This suggests that the force field lacks the degrees of freedom required to accurately capture carbons and hydrogens in ketone environments alongside the other environments represented by the same SMIRKS patterns. It is possible that including a more specific hydrogen or carbon parameter for this environment might improve prediction of ketone densities. Another possibility is that the LJ parameters are compensating for deficiencies in the AM1-BCC electrostatic model which was not optimized in this study. This result will be explored in further work as it is beyond the scope of the current study. However, analyses such as these point out how additional interaction types can be motivated by the large sets of data generated by this sort of study. . This suggests that adding a separate parameter (or parameters) to explicitly address ketone environments is likely to improve parameterization.

Overall Results
Each of the retrained force fields and the original OpenFF 1.0.0 force field was assessed against the larger test set. We see that all of the refit force fields improve RMSE against experiment, compared to the base OpenFF 1.0.0 force field. The Kendall and RMSE values for each force field against the test set are shown in Figure 8. Force fields reoptimized on all the different sets of observable data have statistically similar performance on and ( ), with highly accurate predictions of densities. Binary mixture densities in particular show significant improvement over the OpenFF 1.0.0 force field, with all refitted force fields achieving an average RMSE <=0.02 g/mL, compared to 0.025 g/mL for OpenFF 1.0.0. We note that pure and mixture densities are already highly accurate (RMSE ≤ 0.03 g/mL) when predicted with the OpenFF 1.0.0 force field, so improvements in enthalpic properties are likely more meaningful given that densities are already well predicted.
Force fields that include mixture data ("mixture only" = green, "mixtures + pure density" = red, "pure and mixture" = purple) performed best on Δ ( ), significantly outperforming the "pure only" (orange) force field. For Δ , the force fields trained against Δ ("pure only" and "pure and mixture") have the lowest RMSE, while the force fields trained against only mixture data perform similarly to the original force field. We also note that the correlation vs. experimental is improved slightly when training against the "mixture only" set, with an of 0.85 (95% CI 0.0.74,0.92), vs. the "pure only" set, with an of 0.80 (95% CI 0.67,0.89). While the Δ RMSE is higher when trained against the "mixture only" (9.95 kJ/mol, 95% CI (7.92,12.04)) set vs. the "pure only" training set (7.47 kJ/mol, 95% CI (5.47,9.45)); training against the "pure and mixture" set (containing both the pure and mixture data)) performs similarly on RMSE (7.51 kJ/mol, 95% CI (5.24,9.54)) for Δ , while maintaining the improvement on Δ ( ). However, the utility of improved Δ predictions is questionable for a force field intended to be used for biomolecular systems where vaporization does not typically occur.
These results indicate that mixture properties can replace physical properties of pure systems as a target for training LJ parameters, particularly in cases where more and more chemically diverse data is available for mixtures. Training against the "pure only" set does lead a significant improvement to Δ ( ) against the baseline; however, training directly against the "mixture only" set yields a much larger improvement. It appears that training against properties of mixtures alone sufficiently constrains the optimization, and includes enthalpic information that the traditional pure dataset alone does not. We also note that augmenting a traditional pure data training set with mixture data (such as the "pure and mixture" set) can improve treatment of mixture properties without degrading performance on pure properties.

Results by chemical environment
Notably, training against the mixture properties appears to have corrected a systematic error in the enthalpy of mixing, which training against pure properties alone is not able to correct. This is evident in the parity plots for Δ ( ), where a systematic underprediction of alcohol/ester (green points) and alcohol/ketone (orange points) mixture enthalpies is corrected (Figure 9). The improvement in the treatment of alcohol/ketone mixtures was achieved without directly including these mixture in the training set. ( ) data points for the test set, plotted for force fields optimized against the "mixture only" and "pure only" training sets, as well as the baseline OpenFF 1.0.0 (Parsley) force field. The systematic error in alcohol/ester and alcohol/ketone mixtures (highlighted green and orange points) is significantly reduced when training against the properties of mixture, but not when training against properties of pure systems. This is particularly significant as alcohol/ester and alcohol/ketone mixture enthalpies have strong positive deviations from ideality. Namely, ketone and esters are both hydrogen bond acceptors only and thus do not form hydrogen bonds in the pure phase. However when mixed with a hydrogen bond donor (an alcohol) they do. This is where mixture properties, and especially their ability to more readily capture complementary interactions, appear to be advantageous over pure properties.

Conclusions
Using our automatic data set selection and force field optimization workflow, we re-parameterized select LJ parameters of the OpenFF 1.0.0 force field against training sets containing combinations of pure ( , Δ ) and mixture ( ( ), Δ ( )) properties for alkanes, alcohols, esters, ethers, ketones, and acids. These training sets were controlled such that the same molecules are used in both pure and mixture training sets, to isolate the effect of the different data types used. Through iterative optimization of parameter sets, new force fields were produced that all exceeded the performance of the initial force field on the broader test set. Furthermore, we observe that training LJ parameters against mixture data constrains the optimization in a comparable or superior manner to optimizing with the traditional pure properties commonly used in LJ parameterization.
We have shown that training against mixture properties, specifically Δ ( ), is a compelling alternative for capturing enthalpic contributions to LJ interactions to Δ . Training against Δ is problematic due to limited data coverage and quality, as well as changes in molecular polarization between liquid and gas phase simulations. Mixture property datasets also offer expanded datasets by varying composition, and are more widely available in the ThermoML Archive. Moreover, we have shown here how mixture properties offer significant advantages over pure properties as an optimization target, especially in those cases of interactions which deviate strongly from ideality. These advantages lead to improved LJ parameters sets and better agreement with experiment. We also note that parameterizing against mixture properties alongside pure properties produces parameter sets with improved Δ ( ) and Δ . Given that we control for the identity of the molecules in the training set, this demonstrates that mixture properties contains information about LJ interactions that pure component property measurements do not.
While the parameter sets we demonstrate in this work improved both enthalpies of vaporization and enthalpies of mixing, in our view, improvements in the properties of mixtures are a better metric of force field improvement than pure or phase change properties for force fields intended for use in biomolecular simulations, since simulations typically take place in mixed aqueous or other liquid phases. The same interactions captured in enthalpies of mixing should also be informative for properties of pharmaceutical/biomolecular interest, such as binding affinities and solvation free energies. For this reason, optimization of LJ parameters against mixture property targets is planned to be the standard going forward for our OpenFF force fields. It is also important to note the scope of the study is limited to LJ parameters, and that other parameters, such as electrostatics, torsions, and 1-4 atomic scalings will impact the accuracy of these mixture properties. We anticipate that the automated property prediction in our parameterization workflow, along with the wider chemistry covered by the mixture properties in the ThermoML Archive, will lead to more accurate LJ parameters for general small molecule force fields.

Data and Code Availability
Scripts to run the simulations and reproduce the results in this study are available at https://github.com/ SimonBoothroyd/binary-mixture-publication.
The training and test data sets used in this publication are also available in this repository in .csv and .json formats.
To provide feedback on performance of the OpenFF force fields, we highly recommend using the issue tracker at http://github.com/openforceeld/openforceelds. For toolkit feedback, use http://github.com/ openforceeld/openforceeld . Alternatively, inquiries may be e-mailed to support@openforceeld.org, though responses to e-mails sent to this address may be delayed and GitHub issues receive higher priority. For information on getting started with OpenFF, please see the documentation linked at http://github.com/ openforceeld/openforceeld, and note the availability of several introductory examples.
V. Gerstner Young Investigator Award, and the Sloan Kettering Institute. A complete funding history for the Chodera lab can be found at http://choderalab.org/funding.