Improving small molecule force fields by identifying and characterizing small molecules with inconsistent parameters

Many molecular simulation methods use force fields to help model and simulate molecules and their behavior in various environments. Force fields are sets of functions and parameters used to calculate the potential energy of a chemical system as a function of the atomic coordinates. Despite the widespread use of force fields, their inadequacies are often thought to contribute to systematic errors in molecular simulations. Furthermore, different force fields tend to give varying results on the same systems with the same simulation settings. Here, we present a pipeline for comparing the geometries of small molecule conformers. We aimed to identify molecules or chemistries that are particularly informative for future force field development because they display inconsistencies between force fields. We applied our pipeline to a subset of the eMolecules database, and highlighted molecules that appear to be parameterized inconsistently across different force fields. We then identified over-represented functional groups in these molecule sets. The molecules and moieties identified by this pipeline may be particularly helpful for future force field parameterization.


Introduction
Molecular simulations are widely used in drug design, materials design, and in the study of biophysical processes. Large systems, like biomolecules or even small molecules in solution, prove to be computationally difficult to simulate at the quantum mechanical (QM) level of theory. For this reason, classical empirical potential energy functions known as force fields are often used in place of quantum mechanics in order to efficiently simulate chemical and biological systems. General small molecule force fields, such as the general AMBER force fields GAFF and GAFF2 [39][40][41], OPLS [17,23], CGenFF [36,37], and the Merck molecular force fields MMFF94 and MMFF94S [10][11][12][13][14][15][16], were built to model a wide variety of small organic molecules. These force fields are often fit to attempt to reproduce energies and geometries observed in QM calculations. However, when applied to new molecules, they have been observed to differ from both quantum mechanical calculations and from each other in predicted energies and optimized geometries for important areas of chemical space [3,7,27,33].
In the present study, we aimed to identify regions of chemical space where parameterization differences between force fields lead to different optimized geometries for small drug-like molecules in the gas phase. Geometric differences between force fields for some molecules would indicate that the underlying force fields describe the molecule differently, and thus are indicative of force field differences. Here, a subset of molecules from the eMolecules database [5] was used as a broad sample of small molecule chemical space. Five energy minimizations were performed on each molecule using one of five force fields: GAFF, GAFF2, MMFF94, MMFF94S, and the Open Force Field Initiative's SMIRNOFF99Frosst [27]. Two geometric measurements, Torsion Fingerprint Deviation [31] (TFD) and Tani-motoCombo [18], were used to better identify meaningful geometric differences that may suggest parameterization inconsistencies.
One key assumption in our work is that large geometric differences in optimized geometries tend, overall, to be indicative of substantial differences in the underlying force fields. In other words, we operate with the belief that differences in force fields which are substantial enough to result in large differences in optimized geometries are interesting to force field developers. This assumption does not mean that such force field differences are necessarily large; indeed, small force field differences can result in large differences in optimized geometries [6,27,33]. This is because many organic molecules have a large number of conformational minima often separated by relatively small barriers, so small force field differences may cause a molecule to optimize into different minima. Rather, we assume that force field differences which are large enough to substantially alter optimized geometries are of interest, even if the force field differences themselves are relatively small. All minimizations were performed with the same starting structure to ensure that differences observed are as attributable as possible to differences between force fields.
In part, our work is motivated by the Open Force Field Initiative (OpenFF), which seeks to develop open data sets and infrastructure which can be used to produce new force fields which improved accuracy. It recently released an initial prototype force field, SMIRNOFF99Frosst [27] and, given our connection with OpenFF, SMIRNOFF99Frosst is one focus of our testing in the present study.
By identifying particular functional groups or substructures that lead to drastically different geometrically optimized conformers, we will have identified a portion of chemical space that is inconsistently parameterized by the gamut of force fields studied, and thus is likely to be inaccurately described by at least some of these force fields. In the future, these molecules could be prioritized when training new force fields through inclusion in QM reference calculations or searches for new experimental data.

Results and discussion
In this study, we aimed to identify portions of small molecule chemical space which are particularly informative for force field development. After filtering eMolecules as described in section 'Molecules were sourced from the eMolecules onlinedatabase', we were left with 2.7 million molecules. We optimized each of these molecules with each of the five force fields considered-GAFF, GAFF2, MMFF94, MMFF94S, and SMIRNOFF99Frosst [10][11][12][13][14][15][16]27]. For any given molecule, we performed pairwise comparisons of these five minimized conformers, yielding ten comparisons that we here call "molecule pairs" (though each member of a molecule pair is actually the same molecule in different conformations). Each of the molecule pairs was evaluated for geometric differences using Torsion Fingerprint Deviation (TFD) [32] and TanimotoCombo [18]. We limited our analysis to molecules having 25 or fewer heavy atoms. Furthermore, we restricted our analysis to molecule pairs which yielded a TFD value less than 0.60 and a Tani-motoCombo value between 0.25 and 2.0. These cutoffs were chosen based on visual inspection, as explained in detail in section 'Methods'. Last, we sort molecules into different sets, which were then characterized using the Checkmol [8,9] functional group identification tool.
Here, we chose TFD and TanimotoCombo, rather than the more common RMSD, as key metrics for this analysis. The primary trouble with RMSD is that it is highly dependent on molecular size. For example, a value of 1.0 Å might correspond to a very large geometric difference for an extremely small molecule (e.g. butane) but a trivial geometric difference for a large, drug-like molecule (e.g. lipitor). Both TFD and TanimotoCombo are dimensionless numbers covering a well defined scale (TFD from 0 to 1; TanimotoCombo from 0 to 2) allowing us to define similarity and difference flags which are independent of molecular size. As described above, these metrics also track well with the qualitative structural differences we hope to identify in molecule pairs. While RMSD also captured some of these differences, its size dependence makes it impractical for surveying a wide variety of molecules.

Molecule pairs were flagged as similar or different based on TFD and TanimotoCombo
We used TanimotoCombo and TFD to identify molecules with dissimilar geometries to seek molecules with parameter inconsistencies. We assign a "difference flag" to a molecule pair (in a "molecule pair", the comparison is made across force fields) when it yields a TFD value over 0.20 and a TanimotoCombo value over 0.50. These pairs visually exhibit different minimized geometries that may be indicative of parameterization differences. Out of 26,984,560 possible molecule pairs involving any pair of force fields, the combination of the SMIRNOFF99Frosst and GAFF2 force fields yielded the largest number of difference flags (305,582, Table 1). This indicates that these force fields are quite different. In contrast, the combination of MMFF94 and MMFF94S yielded the smallest number of difference flags at 10,048 difference flags, indicating that these two force fields are the most similar among those being compared. These numbers are sensible given the history of these force fields-GAFF2 has undergone considerable recent reparameterization [39], and SMIRNOFF99Frosst inherits parameters from parm@Frosst [1], a sibling force field of GAFF, while reducing the number of parameters with an entirely different form of chemical perception [26,27]. In contrast, MMFF94 and MMFF94S are identical aside from their treatment of some nitrogen atoms [15]. Consequently their optimized conformers should be rather similar, as reflected in our scores. Thus, these results match what would be expected from the parameterization history of these force fields.
We also label molecule pairs with highly similar geometries. To do this, we assign "similarity flags" to molecule pairs that yielded TFD values under 0.18, indicative of similar geometries (Table 2). In order to visualize the number of molecule pairs with each flag, we plot TFD versus Tani-motoCombo for all molecule pairs in Fig. 1. We highlight regions flagged as similar and different along with regions outside the interest of this analysis. Figure 1 likewise shows that the vast majority of molecule pairs were rated similar by both TFD and TanimotoCombo.

Sets of molecules were created based on their similarity and difference flags
We then sort the molecules into sets of interest by their patterns of difference and similarity flags. As molecule pairs were formed from a set of five conformers, each resulting from optimization with a different force field, each molecule results in ten different molecule pairs which can be assigned either a difference or similarity flag. All molecules that yielded five or more difference flags out of ten were added to the set named "FivePlus." We also categorized molecules of particular interest for each force field. For each force field, we identified molecules in which two conditions held: (1) all molecule pairs involving that force field were flagged as different, and (2) the molecule pairs not including that force field were flagged as similar. Accordingly, molecules in these sets must result in four difference flags and six similarity flags; molecules in these sets can not also be in the FivePlus set. This allows us to highlight molecules which were treated differently by only one force field, potentially indicating problems in the force field's parameters for the represented chemistries of the molecule. We called this set the "Individually Different" set for that force field. For example, the molecules identified in this scheme for SMIRNOFF99Frosst were added   to the "Individually Different SMIRNOFF" ( ID SMIRNOFF ) set. This latter analysis is probably most relevant to the SMIRNOFF force field, as GAFF/GAFF2 and MMFF94/ MMFF94S come in families which would reduce the number of cases meeting these criteria if intra-family similarity is high-specifically, if both family members treat a molecule consistently, it will not be flagged as "individually different" for that force field. Our results after categorizing put 111,162 molecules into the FivePlus and 93,859 molecules in the ID SMIRNOFF set out of a total of 2,698,456 molecules. The ID SMIRNOFF set was the largest of the individually different force field sets, as is displayed in Table 3. As noted, we had some expectation SMIRNOFF might be relatively distinct from the other force fields considered.
Here, we focused on identifying molecules with significant geometric differences between force fields, and our sets were constructed to help identify these molecules, but other factors might also be important to examine in future work. For example, if different force fields lead to similar optimized geometries, that does not necessarily mean those force fields are similar. To examine whether energetics of the different force fields are similar, we would need to study the relative energetics of conformers of different molecules in different force fields, which is not something within the scope of this work as it would require multiple conformers per molecule. However, relative energetics have been examined in a separate study [24]. Here, then, we focus on identifying geometric differences which likely imply force field differences, though geometric similarities do not necessarily imply force field similarities.

Certain functional groups are more likely to appear in molecules with geometric differences
We characterized molecules with five or more difference flags Molecules which yielded five or more out of ten possible difference flags were separated into what we call our FivePlus set. This set contained 111,162 total molecules, comprising 4.62% of all molecules included in this analysis. Visualizations of selected molecule pairs from the FivePlus set displaying significant geometric differences are shown in Fig. 2.
We observed 150 Checkmol functional group descriptors with at least two occurrences within the FivePlus set. For each descriptor, we compared the proportion of FivePlus molecules with this descriptor to the proportion of molecules with this descriptor in the total set (Eq. 1), to assess whether any particular chemistries/functional groups tend to increase the likelihood of force fields treating molecules differently (and thus it ending up in the FivePlus set). We then identified the descriptors that are over-represented within the FivePlus set. For each of the descriptors we include in this section, we will provide an inline SMILES pattern for that descriptor along with the number of molecules with that descriptor in the current set of interest and the total set in the form (SMILES, number of molecules with the descriptor in the set of interest, number of molecules in total). The most under-represented descriptor in the FivePlus set was the ketene ([R]C([R])=C=O, 9, 2124), with an overrepresentation factor of 0.11. This suggests that most force fields describe geometries of ketenes consistently, possibly due to the ketene functional group's simple linear structure. We repeated this process with pairs of Checkmol descriptors to see whether particular combinations of descriptors are especially indicative of discrepancies. We observed 6,500 descriptor pairs occurring in at least two cases in the FivePlus set. As with singular descriptors, we compared Table 3 Number of molecules in each set of interest Shown are the number of molecules in each of six sets of interest (described in section Molecule pairs were flagged as similar or different based on TFD and TanimotoCombo); briefly, the FivePlus set contains molecules with substantially different geometries across multiple force fields, whereas the other sets contain molecules in which only the indicated force field yields a substantially different geometry from other force fields. The set with the largest number of molecules, the FivePlus set, contains 111,162 molecules out of the 2,698,457 molecules analyzed. No molecule can appear in more than one set of interest , which was over-represented in the FivePlus set by a factor of 24.28, but the number of molecules with this particular combination is so low it makes it hard to know how much weight to give this observation. We determined by visual inspection that the imidoyl halide and oxime functional groups were in close proximity in these molecules, such that they may form a conjugated system. The force fields inconsistently predicted planar groups within this larger system. Two other descriptor pairs were over-represented in the Five-Plus set by a factor greater than 19: Again, these combinations are rare, so conclusions must be tentative at best (Table 4).
Some pairs of descriptors are more likely to appear in the set of interest together more often than they are apart. We quantify this dependence by our pair enrichment factor (PEF) measurement (Eq. 2). The descriptor pair that showed the greatest degree of this dependence is quaternary ammonium salts paired with secondary aromatic amines ([R] 11,12), which yielded a pair enrichment factor of 2,807. Two other descriptor pairs yielded pair enrichment factors greater than 1,000: These findings display that heteroatoms, especially in delocalized pi-systems, are likely to lead to inconsistent optimized geometries. In particular, nitrogen, phosphorus, and sulfur atoms were found in all of the most over-represented descriptors and descriptor pairs. This is in line with our expectations, as QM treatments of sulfur and phosphorus are computationally expensive. Early force field development may have prioritized parameters for only the most common functional groups that involve sulfur and phosphorus. Our procedure has identified molecular fragments that yielded inconsistent geometries, and therefore can be improved upon in future force fields. Furthermore, nitrogen planarity errors are a known issue across force fields [15,27]. We therefore believe that the descriptors identified by this procedure may be informative for the creation/training of higher accuracy small molecule force fields. Molecules containing these fragments should be included in future force field training sets in order to create more accurate and general small molecule force fields. We characterized molecules where SMIRNOFF was individually different. The OpenFF Initiative seeks to improve force fields via a series of progressive improvements, thus we focus on the SMIRNOFF force field in particular in order to help our work with OpenFF. Specifically, we identify molecules where parameterization differences in SMIRNOFF relative to other force fields lead to geometry differences. Molecules that yielded four difference flags from combinations involving the SMIRNOFF-minimized conformer, and six similarity flags from combinations not including the SMIRNOFFminimized conformer, were likewise grouped into a set of interest. We refer to this set as the Individually Different SMIRNOFF ( ID SMIRNOFF ) set. This set contained 93,859 molecules in total, or 3.48% of all molecules included in this analysis. Visualizations of example molecule pairs from the ID SMIRNOFF set displaying geometric differences are shown in Fig. 3.
We observed 139 Checkmol descriptors in at least two molecules in the ID SMIRNOFF set. We compared the proportion of molecules exhibiting some descriptor within the ID SMIRNOFF set to the proportion of molecules exhibiting the descriptor in the total set (Eq. 1). We then identified descriptors that are over-represented or under-represented within the ID SMIRNOFF set ( Table 5). The most over-represented descriptor within the ID SMIRNOFF set was the azo compound descriptor ([R]/N=N/[R], 717, 1500) which was over-represented in the ID SMIRNOFF set by a factor of 13.74. Such compounds have been a focus of reparameterization efforts in more recent versions of SMIRNOFFbased force fields, in particular in OpenFF 1.1. [21,38], consistent with our observation here that these may be poorly treated. We discuss later OpenFF releases further below. Four other descriptors were over-represented in the ID SMIRNOFF set by a factor greater than 4: We observed 5805 descriptor pairs in at least two molecules in the ID SMIRNOFF set. As with singular descriptors, we compared the proportion of molecules displaying a descriptor pair in the ID SMIRNOFF set to the proportion of molecules displaying a descriptor pair in the total set (Eq. 1). These descriptor pairs and their over-representation factors are likewise included in Table 6. Six different descriptor pairs were tied as most over-represented in the ID SMIRNOFF set (Table 7). For these, all molecules displaying these pairs in the total set were also included in the ID SMIRNOFF set. For example, there were five molecules characterized as both ketene acetal derivatives and oximes ( 5,5), and all five of these molecules were also present in the ID SMIRNOFF set. We observed two other descriptor pairs which occurred in greater than 10 molecules in the ID SMIRNOFF set and had an over-representation factor greater than 20: We also calculated pair enrichment factors (PEFs), as described in Eq. 2, for the ID SMIRNOFF set of molecules. The descriptor pair that showed the greatest degree of this Descriptor pairs with high pair enrichment factors may suggest unique chemistries that lead to geometric inconsistencies that were not accurately described by single descriptors. Nitrogen atoms in conjugated systems make up a large portion of molecules that were optimized to unique structures by SMIRNOFF. While other force fields have likewise had problems with nitrogen planarity, our results display two Checkmol descriptors, azo compound and hydrazone, that are especially informative for SMIRNOFF. By visual inspection, molecules with one of these descriptors in between two aromatic rings are especially prominent, as can be seen in boxes 2, 3, and 4 of Fig. 3. QM calculations are necessary to determine if SMIRNOFF's minimized conformers were more or less accurate than other force fields (indeed, the data sets from this work are being used by OpenFF to do precisely these tests, and to help drive further force field optimizations [2,21,24,29]). Still, molecules like these will be useful in training sets of future force fields. In other cases, such as those displayed in boxes 5 and 6 of Fig. 3, SMIRNOFF disagrees with other force fields on the geometry of secondary carbon atoms in certain environments. SMIRNOFF assigns parameters to molecules separately by type (i.e. bonds, angles, and torsions are treated independently) with explicit treatment for bond order which differs from the atom-type approach used by the other force fields in this study [26]. It is possible this change in chemical perception can help account for the change in treatment of these systems. QM data on these molecules will be useful for future iterations of the SMIRNOFF force field, which are already in development. [2,21,24,29]

This work has been used to improve training datasets for the OpenFF Parsley series
In the present work, discrepancies between optimized geometries from different force fields highlight potential issues, but we have no ground truth or point of reference for sorting out which geometries are correct and which are not. This data simply helps us select molecules/chemistries which may be informative, and prioritize them for further study. Particularly, one might generate optimized geometries for these same molecules with QM calculations and then use these to help assess which force fields produce the best results, or use these in force field training sets to improve force field quality. Indeed, informative molecules from the present study are being used for precisely that purpose. Particularly, a subset of the FivePlus set was used as the basis for the "coverage" set used for the first OpenFF Parsley release, OpenFF 1.0 [29]. A larger portion was used in benchmarking OpenFF 1.0. Then, for OpenFF 1.2, training data was completely redesigned, in part drawing from what was called the "eMolecules Discrepancies Set" [22,25], corresponding to the first portion of the FivePlus set generated here. This training data redesign resulted in improved performance on a variety of benchmarks [21,24]. The relevant optimized geometries are freely available in QCArchive [34] as part of the OpenFF 1.2 training and benchmarking datasets.
While subsequent OpenFF work building on the data generated here is not formally part of this study, it does appear that molecules identified as potentially informative by this approach do serve well as input for QM calculations and force field training, at least when coupled with additional data selection and curation steps.

Methods
In order to help improve force fields, we sought to to identify where current force fields differ from one another. Here, we compared results of force fields (particularly, optimized geometries) after energy minimizing a large subset of the eMolecules database to identify sets of molecules for use in future force field parameterization.
Multiple force fields were used to minimize conformers We created input files for multiple force fields from a filtered eMolecules set (filtering described in section 'Molecules were sourced from the eMolecules online database'). We generated molecules from the SMILES strings as in eMolecules, adding explicit hydrogens and assigning default protonation states using the OpenEye toolkits. We did not enumerate protonation states or tautomers, and no significant effort was invested in selecting protonation states; we simply took the default states provided by the toolkit. We do not see this as a major limitation in a force field comparison since the resulting approach tests the force fields thoroughly on the molecules and protonation states used, even if that protonation state or tautomer will not be the most populated at neutral pH in solution.
Following construction of initial molecules, initial conformers were generated with OpenEye's Omega, then partial charges were assigned to molecules before minimization using the OpenEye implementation of AM1-BCC [19,20]. The input generation process yields one Tripos MOL2 file to be minimized directly with SMIRNOFF99frosst, MMFF94, and MMFF94S, as well as individual input coordinate and parameter topography files for use by GAFF (1.8) and GAFF2 (2.1). These force fields were chosen because they are widely used, easily available, and compatible with our workflow. Other force fields were either incompatible with our toolchain without substantial additional work, or were commercial and proprietary. For example, comparisons with CGenFF [36,37], OPLS-AA [23], or the Schrödinger OPLS series [17,30] would be of considerable interest, but these require substantially different toolchains, and the most recent Schrödinger force fields are also proprietary and require paying for a license.
We minimized each molecule using the parameters from each of the five aforementioned force fields, making sure to start all five minimizations from the same conformer. Minimizations with force fields other than MMFF were performed with OpenMM 7.0.1 [4] using the L-BFGS algorithm [28] with an energy tolerance of 5.0e−9 kJ/mol and a maximum of 1500 iterations. MMFF minimizations were performed with OpenEye's Szybki Toolkit [35,42]. Sample run files can be found in the Supporting Information. Molecules that did not successfully result in five minimized structures (one from each force field), were removed from analysis. For each molecule with five minimized structures, pairwise comparisons yielded a total of ten molecule pairs for geometric evaluation. We call these pairs of minimized conformers generated by different force fields "molecule pairs." Molecule pairs were assessed using Torsion Fingerprint Deviation and TanimotoCombo We then assessed each molecule pair for geometric differences. Molecule pairs were evaluated using two distinct measurements: Torsion Fingerprint Deviation (TFD) and TanimotoCombo. TFD is a method of measuring geometric differences between two conformers of the same molecule based on torsion angles. The TFD score between two structures represents a weighted sum of torsional differences as defined by Schulz-Gasch et al. [31]. Torsions central to the molecule are given more weight than torsions on the periphery of the molecule. Similarly with RMSD, geometric similarity is Fig. 2 Molecule pairs from the FivePlus set display visual geometric differences. The six molecules displayed here were identified from the FivePlus set using the over-represented descriptor and descriptor pair method described in Section 3.1 and thus are molecules where geometries differ substantially across force fields. Each panel shows a molecule (with the 2D structure shown as inset) and a pair of minimized conformers resulting from optimization with different force fields. These highlight geometric differences between minimized structures. While many structure pairs yield difference flags for molecules in the FivePlus set, only one structure pair is displayed for each molecule here. The lightly colored structure was optimized with GAFF, while the darkly colored structure was optimized with SMIRNOFF. (1) While GAFF predicts a planar structure of the ring system, SMIRNOFF predicts a buckled ring for this molecule with the disulfide descriptor. (2) GAFF predicts the imidoyl halide group to be nonplanar in this molecule with the imidoyl halide and oxime descriptors, while SMIRNOFF predicts it to be planar.
(3) SMIRNOFF predicts a larger bond angle between the amine and non-bridging oxygen than does GAFF in this molecule displaying the phosphoric acid amide descriptor. (4) This molecule displays both the quaternary ammonium cation and the secondary aromatic amine descriptors. While SMIRNOFF predicts a planar thiadiazolium ring, GAFF predicts it to be nonplanar. (5) While GAFF predicts the thiocarbamic acid halide fragment to be planar and perpendicular to the aromatic ring, SMIRNOFF predicts it to be nonplanar and off-perpendicular to the aromatic ring. (6) This molecule displays both the thioxohetarene and imine descriptors. While GAFF predicts a planar pyrroline ring, SMIRNOFF predicts this ring to be buckled.

3
inversely correlated with TFD score. TFD scores range from 0 to 1, with 0 being most similar and 1 being most different. The authors of TFD consider scores over 0.2 to represent significantly different geometries. In contrast to RMSD, TFD is bounded and less sensitive to molecular size, making it particularly helpful here.
TanimotoCombo, from OpenEye Scientific, is a normalized method of measuring geometric similarity between molecules. It is the sum of ShapeTanimoto, a measure of overall spatial overlap between two molecules, and Color-Tanimoto, a measure of spatial overlap of specific functional groups between two molecules, both of which are also metrics from OpenEye. TanimotoCombo values between two conformers range between 0 and 2 (it is the sum of two values each running from 0 to 1), with 2 being the most similar and 0 being the most different.
By visual inspection, we determined that Tanimoto-Combo is useful for recognizing cases where geometric differences are caused by particularly flexible moieties, such as single bond rotations in an alkyl chain. These differences can often be attributed to minor differences between force fields leading to flexible bond rotations, not to larger differences in force fields that result in more substantial geometric differences. Thus, here, we find that TanimotoCombo alone does not serve to help us isolate geometry differences that are likely due to substantial force field differences; instead, low TanimotoCombo values can result from simple bond rotations that result from molecules energy minimizing to different local minima that we do not consider particularly interesting by visual inspection. However, TanimotoCombo in conjunction with TFD can be used to identify geometric differences that suggest underlying inconsistencies in parameterization.
Molecule pairs were flagged as similar or different based on TFD and TanimotoCombo We identified molecule pairs displaying parameterization differences which led to different geometries using TFD and TanimotoCombo. TFD is sensitive to ring deformations, torsional differences, and atom planarity changes, which makes it useful for recognizing differences in parameterization. TanimotoCombo, with greater sensitivity to coordinate differences caused by conformational flexibility in a molecule, is more useful for removing cases that are less likely to be caused by parameterization differences, such as different rotameric states.
We chose cutoffs to identify molecule pairs displaying parameterization differences (flagged "different") and pairs displaying no parameterization differences (flagged "similar"). TFD values below 0.20 are believed to be pharmacologically similar [31], so we chose a TFD value greater than 0.20 to label molecule pairs as different. After visual inspection of a variety of molecules, we observed that molecule pairs with a TanimotoCombo under 0.5 typically had changes due to single bond rotations. Because such bond rotations can arise from a variety of reasons aside from substantial differences in parameterization, we did not wish to focus on such cases. Thus, molecule pairs with a TFD value greater than 0.20 as well as a TanimotoCombo value greater than 0.50 were flagged as different -allowing us to focus on cases with substantial torsional differences which were not simply due to rotations around highly flexible bonds. We used a substantial amount of manual inspection of these thresholds to help us make these choices. As a result of these choices, any pair of molecules with a TFD value of 0.18 or less was assigned a similarity flag, as it will display geometrically similar structures. We left a small buffer region between 0.18 and 0.2 when defining similarity flags in order to avoid an extreme sensitivity to small changes around the 0.20 cutoff (Fig. 4).
Molecule pairs that yielded very high TFD or very low TanimotoCombo values were also determined to often be uninformative. Tagging these molecule pairs as "different" would be unhelpful because the differences are not due to substantial changes in force field parameters.

We created and characterized sets of interest
Molecules can be sorted into sets of interest by considering the combinations of their difference and similarity flags. A single molecule in this pipeline is associated with five minimized structures. Pairwise combinations of these structures will yield ten molecule pairs and thus up to ten flags. Molecules that yielded a large number of difference flags, regardless of the force fields of origin, are of particular interest for force field parameterization. Specifically, we set aside molecules with five or more difference flags for further analysis, we call this our FivePlus set. The other sets of interest are based on the origin of the difference flags with the goal of identifying molecules which behave differently with one force field than all the others. For a molecule to be considered different with that one force field, all four molecule pairs involving that force field should be flagged as different, and all other molecule pairs need to be flagged as similar. We call these Individually Different Sets for each force field, i.e. for SMIRNOFF we create the SMIRNOFF Individually Different set labeled by ID SMIRNOFF . A molecule in the ID SMIRNOFF set would have 4 difference flags, one for each pair involving SMIRNOFF, and six similarity flags for all other force field combinations.

Sets of interest were analyzed by the frequencies of the functional groups
Identifying functional groups which are more prevalent in our sets of interest could be informative for future force field parameterization. To this end, we used Checkmol [8] to describe the combination of functional groups in each molecule. When given a molecule, Checkmol provides a list of descriptors for the functional groups it contains. For each descriptor, we count the number of affiliated molecules in each set of interest as well as in the entire molecule set. From there, we can determine the most over-represented descriptors in each set of interest. We only considered descriptors and descriptor pairs that appeared at least twice in our full molecule set.
We compute the over-representation factor describing how over-represented a particular descriptor is in a given set by dividing the frequency of the descriptor in the set by the frequency of the descriptor in the full molecule set. Mathematically, we can write where N A,set is the number of molecules containing descriptor A in a particular set, N mols,set is the number of molecules in that particular set, N A,total is the number of molecules in total with descriptor A, and N mols,total is the number of molecules in total.
Force field behavior could change with combinations of functional groups, and thus we repeated this calculation with pairs of Checkmol descriptors. We can apply Eq. 1 to analyze pairs of descriptors by replacing A with A + B to represent molecules containing both descriptors. However, with pairs of descriptors, we are more interested in whether the combination of the descriptors is important. For example, if both descriptors A and B are highly probably in a set of molecules, then finding the combination in that set at a higher frequency is not particularly interesting. Thus, we try to determine if the descriptor pair is more likely to show up in a set of interest than the individual descriptors separately. To that end we calculate an enrichment factor given by (2) p A+B p A ⋅ p B Fig. 4 Molecule pairs with low TanimotoCombo and low TFD scores are often uninformative. Here, we show an example of a molecule pair that does not seem informative for force field parameterization. The lightly colored molecule was minimized with GAFF, while the darker molecule was minimized with SMIRNOFF. The two minimized structures display little geometric differences outside of the orientation of substituents around the sulfonamide group; most of the geometric difference appears due to the rotation of a single torsion. The low TFD value of 0.046 implies that these structures are highly similar by TFD, while the low TanimotoCombo value of 0.27 implies that these structures are starkly different by TanimotoCombo. By visual inspection of this molecule and others, we determined that molecule pairs with low Tanimoto Combo and low TFD scores were often not as informative, at least with respect to our goals in this project where p A+B denotes the observed frequency (probability) of a molecule with the combined A and B descriptors being found in the set of interest, and p A and p B denote the individual frequencies for descriptors A and B in the same set of interest. For example, p A is given by A larger enrichment factor indicates that the combination of descriptors A and B is more likely to occur in a set of interest than those descriptors individually. Descriptor pairs with a larger enrichment factor should be considered as important for future parameterization because the combination of functional groups changes a force field's behavior.

Molecules were sourced from the eMolecules online database
Approximately 8.1 million molecules were initially sourced from the eMolecules database as SDF files (version obtained in September 2016) [5]. Molecules from this set were then filtered by several criteria. We removed all molecules that contained any metal or metalloid atoms, were over 200 heavy atoms, or had a nonphysical valence (such as a pentavalent carbon atom). Molecules which failed at any step of the process were also removed, i.e. could not be parameterized by one of the force fields. While we minimized all these molecules with each force field, very large molecules are impractical for visual inspection or future QM calculations. Thus, we filtered the molecules for analysis here to remove molecules with more than 25 heavy atoms.

Conclusions
Here, we sought to determine informative molecules for force field parameterization. We assume that conformational differences in molecules minimized with different force fields indicates those molecules ought to receive additional attention in future force field parameterization.
Thus, we energy minimized a large portion of eMolecules with various force fields, and cross-compared the resulting optimized geometries based on TFD and TanimotoCombo metrics. We chose cutoffs for each of these metrics in order to prioritize conformational differences likely due to changes in force field parameters.
Our analysis flags molecules for further analysis in several ways. First, we single out molecules that differ in treatment across many force fields as molecules which are likely to be particularly informative in general. Second, we can separate out molecules which are treated differently by only one force field as perhaps indicative of problems with that force field in particular. We can further break down informative molecules by looking at representation of functional groups and pairs of functional groups, to identify those that are over-represented among informative molecules, perhaps indicating these functional groups require additional attention in force field parameterization. The descriptors which were over-represented in the FivePlus set could be informative for understanding the limitations of current force field parameterization procedures. All general small molecule force fields currently available depend on human determined typing rules-atom types in most force fields and the SMARTS patterns used in SMIRNOFF-based force fields. The differences in geometries around heteroatoms, especially sulfur and phosphorous, point to the potential bias of the scientists parameterizing each force field. Most of the time new parameter typing rules are added to force fields out of necessity and each group will prioritize different chemistry. Including typing rules in automatic force field parameterization should help reduce this bias since typing rules would be driven by training data rather than human choices.
Finding the more accurate conformation in each molecule pair would require performing a quantum mechanical geometry optimization (QM). QM calculations are significantly more expensive than simple force field optimizations. Our protocol allowed us to explore a greater molecular space, and we analyzed 26,984,560 molecule pairs. Our approach has identified regions of chemical space where force field parameterization is currently inconsistent. Our approach and results have identified descriptor and descriptor pairs which are different for each individual force field. Molecules with these descriptors may be prioritized for future parameterization leading to more accurate force fields overall. Some work along these lines is already in progress [21,22,25,29]. mized geometries of 265,847 molecules with four or more difference flags. An archived copy of the GitHub repository is provided in the electronic Supporting Information associated with this paper.

Disclaimers
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Disclosures DLM is a member of the Scientific Advisory Board of OpenEye Scientific Software and an Open Science Fellow with Silicon Therapeutics.