Development and Validation of the QUBE Protein Force Field

Molecular mechanics force ﬁeld parameters for macromolecules, such as proteins, are traditionally ﬁt to reproduce experimental properties of small molecules, and thus they neglect system-speciﬁc polarization. In this paper, we introduce a complete protein force ﬁeld that is designed to be compatible with the QUantum mechanical BEspoke (QUBE) force ﬁeld by deriving non-bonded parameters directly from the electron density of the speciﬁc protein under study. The main backbone and sidechain protein torsional parameters are re-derived in this work by ﬁtting to quantum mechanical dihedral scans for compatibility with QUBE non-bonded parameters. Software is provided for the preparation of QUBE input ﬁles. The accuracy of the new force ﬁeld, and the derived torsional parameters, are tested by comparing the conformational preferences of a range of peptides and proteins with experimental measurements. Accurate backbone and sidechain conformations are obtained in molecular dynamics simulations of dipeptides, with NMR J coupling errors comparable to the widely-used OPLS force ﬁeld. In simulations of ﬁve folded proteins, the secondary structure is generally retained and the NMR J coupling errors are similar to standard transferable force ﬁelds, although some loss of the experimental structure is observed in certain regions of the proteins. With several avenues for further development, the use of system-speciﬁc non-bonded force ﬁeld parameters is a promising approach for next-generation simulations of biological molecules.


Introduction
Molecular mechanics (MM) force fields for biomolecular simulations have been under continuous development for many years. [1][2][3][4][5] In traditional transferable force fields, every atom in a molecule is assigned a type based on its atomic number, bonding and local chemical environment. The atom type then dictates the parameters that are used to model that atom's interactions. 3 The force field parameters for each atom type are stored as a library, which is built by carefully reproducing the experimental or quantum mechanical properties of a benchmark set of small molecules. [2][3][4][5][6][7] Due to the infeasibility of accurately parameterizing all of chemical space, a balance must be made between the size of the library and potential inaccuracy due to transferring parameters to molecules outside the fitting set. In many cases, it is acknowledged that transferable force fields are not sufficiently accurate. 8 When building force fields for small molecules, the atomic charges are usually assigned in a system-specific or "bespoke" manner, using methods such as RESP, CM1, or AM1-BCC. [9][10][11][12][13] This is because it is well-known that atomic charges polarize in response to their chemical environment (for example, the presence of electron donating or withdrawing groups). 8 Bespoke charges are usually assumed to be compatible with the fixed Lennard-Jones parameters of the force field, although these themselves have also been shown to be dependent on the local environment of the atom. 14 Although proteins must also experience polarization effects in both the charges and Lennard-Jones parameters, protein force field parameters have always, to date, been assigned from a transferable library. [1][2][3]15 This leads to an inconsistency in the parametrization strategy used for protein force fields and bespoke small molecule force fields. This is potentially problematic when studying properties that depend on the electrostatic potentials of proteins, such as their interactions with small molecules, and there is no clear way around this using traditional force field fitting methods.
To improve the consistency between charge and Lennard-Jones parameters, and also reduce the reliance on fitting to experimental data, one could either directly fit non-bonded MM parameters to reproduce quantum mechanical (QM) energies and forces, [16][17][18][19] or derive the non-bonded parameters of the force field directly from QM. In the latter approach, the QM interaction energy may be broken down into physically motivated components using intermolecular perturbation theory, [20][21][22] though these methods are limited to quite small system sizes. Encouragingly, Grimme's quantum mechanically derived force field (QMDFF) method is capable of outputting bespoke non-bonded force field parameters for molecules comprising more than 100 atoms. 23 Despite using fixed point charges, with no explicit polarization term, the bespoke force field reproduces both QM inter-and intramolecular energies to an accuracy of around 1 kcal/mol for small molecule benchmarks.
In recent years, we have been following a similar strategy to Grimme's QMDFF, focusing more on condensed phase properties and heterogeneous systems. 24,25 The basis of this approach is the density derived electrostatic and chemical (DDEC) atoms-in-molecule (AIM) scheme, 26,27 which partitions the total electron density into approximately-spherical atomcentered basins. Atomic charges are derived by integrating the atomic electron density over all space and, in contrast to direct fitting of the QM electrostatic potential (ESP charges), it is possible to derive chemically-meaningful DDEC atomic electron densities and charges for both surface and buried atoms. 28 A further advantage of this approach is that the Lennard-Jones parameters may also be computed directly from the atomic electron densities, using methods based on the Tkatcheko-Scheffler relations that are commonly used to incorporate dispersion effects into density functional theory (DFT) calculations. 14,24 Similar to the Grimme approach, these non-bonded parameters are derived from a single QM optimized structure, which would be problematic if the charges show strong conformation dependence.
However, Manz and Sholl have demonstrated that DDEC charges are transferable between different conformations of a molecule (as measured by their ability to recreate the QM electrostatic potential), and they conclude that the charges are suitable for the construction of flexible force fields. 27 Furthermore, it should be noted that atoms-in-molecule electron density partitioning lends itself naturally to the derivation of both off-site charges to model electron anisotropy (such as lone pairs and σ-holes) 25 and atomic polarizabilities, 29 though we have not yet investigated a fully polarizable force field.
In keeping with our goal of deriving force field parameters directly from QM, rather than fitting to experiment, we have supplemented the atoms-in-molecule non-bonded parameters with harmonic bond and angle parameters derived directly from the QM Hessian matrix. 30 There are a number of methods available for deriving bonded parameters from the QM Hessian matrix, 23,31 but our recent adaptation of the Seminario method 32 (which we name the modified Seminario method) is conceptually quite straightforward whilst yielding parameters that reproduce QM vibrational frequencies with a mean unsigned error of 49 cm −1 , below that of OPLS (59 cm −1 ). Collectively, we have named these methods the QUantum mechanical BEspoke (QUBE) force field. This name reflects the fact that force field parameters are derived by the user specifically for the small molecule under study, directly from QM calculations. We have released a software toolkit (QUBEKit) that facilitates the derivation of small organic molecule force field parameters, and also allows the user to derive the positions of off-site charges to model anisotropic electron density and to fit dihedral parameters to QM torsion scans. 25 QUBE force fields have been derived for 109 small organic molecules, and yield mean unsigned errors of 0.024 g/cm 3 , 0.79 kcal/mol and 1.17 kcal/mol in computed liquid density, heat of vaporization and free energy of hydration. 25 These results are competitive with standard transferable force fields, which have been extensively fit to properties such as these.
To achieve our goal of employing the QUBE force field in computer-aided drug design applications, we require a compatible protein force field. Since the non-bonded parametrization strategy employed in QUBE is very different to that used in the standard biomolecular force fields (e.g. AMBER, OPLS, CHARMM), there is no reason to believe that they are compatible. However, by implementing the atoms-in-molecule non-bonded parameter derivation methods in the ONETEP linear-scaling density functional theory (DFT) software, 33 we have shown that it is feasible to derive these charges and Lennard-Jones parameters for entire proteins. 24 In this way, the number of fitting parameters is substantially reduced, and we have a consistent parametrization approach that can be applied to both small and large molecules, including entire biomolecular assemblies. Since, in this approach, all non-bonded parameters are derived from a single QM calculation, both the charge and Lennard-Jones parameters naturally include the native state polarization effects of the environment. Importantly, we have shown that protein charges derived using DDEC electron density partitioning recreate the underlying QM electrostatic potential with high accuracy, and that charges derived for a NMR ensemble of BPTI protein structures are not too conformation-dependent (standard deviations per residue less than 0.04 e). 34 This is in contrast to the performance of RESP charges, which have been shown to be significantly more conformation-dependent for an ensemble of polypeptide structures. 35 Additional simulations demonstrated the feasibility and advantages of deriving bespoke parameters for a protein-ligand complex. The computed relative binding free energy of indole and benzofuran to the lysozyme protein using the environment-specific force fields (−0.4 kcal/mol) was in excellent agreement with experiment (−0.6 kcal/mol), and was substantially more accurate than standard force fields (−2.4 kcal/mol). However, the force field was in need of further development as the bespoke non-bonded terms were used in combination with standard OPLS-AA bonded parameters.
The use of these parameters potentially limits the accuracy of the force field due to interdependency between the bonded terms, particularly the torsional parameters, and the non-bonded components of the force field. This issue is the subject of the current paper.
In standard transferable force fields, the torsional component is typically parameterized using QM dihedral energy scans, with the difference between analogous MM and QM energy scans minimized by fitting the torsional parameters. 1 Reparameterization of the torsional terms has been shown to be a crucial step in improving the accuracy of force fields and this has recently been demonstrated for AMBER ff15ipq, CHARMM36 and OPLS-AA/M. 1,2,15 Bond and angle reparameterization has also been shown to be an essential stage in improving the accuracy of biomolecular force fields, 2,36 although it is not so frequently carried out.
Since it is not currently feasible to derive accurate QM Hessian matrices for entire proteins, we have used the modified Seminario method to compute a complete set of bond and angle parameters for the twenty naturally occurring amino acids. 30 This work focuses on the remaining component of the force field, namely the re-fitting of key torsional parameters that describe the backbone and sidechain dynamics of an amino acid. The methods and validation tests broadly follow the approaches employed in the development of OPLS-AA/M, the latest OPLS force field. 1 Torsional parameters are fit by minimizing the differences between multiple QM and MM potential energy scans of dipeptide backbone and sidechain dihedral angles. Our overall goal is to test the extent to which bespoke non-bonded parameters may be combined with libraries of bonded parameters to produce a protein force field that is compatible with our QUBE small molecule force field for use in computer-aided drug design applications.
The performance of the QUBE protein force field is tested through comparisons between experiment and molecular dynamics (MD) simulations for a set of twenty dipeptides, the glycine tripeptide and alanine pentapeptide, and a range of small folded proteins.
This benchmark testset is similar to those used in the development of protein force fields such as AMBER ff15ipq, AMOEBA, CHARMM36 and OPLS-AA/M. 1,2,5,15 As we shall show, the QUBE protein force field is competitive with standard transferable force fields for the dipeptide set and alanine pentapeptide, while retaining the experimental structures of small folded proteins reasonably well. To encourage further testing of the QUBE protein force field, MD input files for the molecules studied, as well as the necessary scripts to convert the QM electron density to QUBE force field format have been made available (https://github.com/cole-group/QUBEMAKER). Finally, in the conclusions, we outline a roadmap for future improvements to QUBE.

Theory
The functional form of the standard biomolecular force field has five components. Covalent interactions between atoms are modelled using harmonic bond stretching and angle bending parameters, while rotations about a bond are described by anharmonic 4-body torsional terms. Non-bonded interactions are described by a sum of Coulombic interactions between (usually) atom-centered point charges and a physically-motivated Lennard-Jones interaction, which combines a short-range repulsive r −12 potential with a longer-range attractive r −6 interaction. We now provide an overview of how these various components are parameterized in the QUBE protein force field, and contrast the approaches to those used in standard transferable force fields. Since the methods used to parameterize the non-bonded, and bond and angle terms have been extensively described elsewhere, 24,25,28,30 we focus here on the derivation of the torsional parameters.

Non-bonded Parameters
The non-bonded components of a molecular mechanics force field aim to describe the quantum mechanical electrostatic, dispersion and exchange-repulsion interactions in a computationally efficient manner. 37 The charge parameters are generally fit to the quantum mechanical electrostatic potential of small molecules. The Lennard-Jones parameters are then fit to reproduce experimental data, such as liquid densities and heats of vaporization. 2,6,38 The aim of QUBE is to move away from the requirement for transferable force field parameters, and instead to derive bespoke parameters for molecules directly from QM calculations. First, a QM simulation of the molecule under study is performed. From the output of the QM calculation, the total electron density of the molecule is partitioned onto individual atoms using an atoms-in-molecule (AIM) weighting scheme. There is no unique method to perform this partitioning, but we favor the density derived electrostatic and chemical (DDEC) scheme, 26,27 which is a weighted combination of the iterative stockholder atoms and iterative Hirshfeld approaches. 39,40 With the electron density partitioned to individual atoms, the atom-centered charges can be simply found by integrating the electron density over all space (and adding the nuclear charge). We have implemented the DDEC approach in the ONETEP software package, which allows us to perform QM calculations of, and assign parameters to, systems comprising thousands of atoms. The derived charges have been shown to be suitable for use in flexible force field design in multiple works, because they are able to reproduce the underlying QM electrostatic dependence while exhibiting low conformation-dependence. 24,27,34 The charges are specific to the system under study and, by performing the QM calculation in the presence of an implicit solvent model, polarization of the charges in the condensed phase can be included in the model. 24 The dispersion and exchange-repulsion interactions are described using a Lennard-Jones potential with a form: The dispersion coefficient, B ij , is calculated from the partitioned electron density by using the Tkatchenko-Scheffler relationship to rescale the free atom dispersion coefficient by the computed volume of the atom in the molecule. 14,24 The standard combination rule, can then be used to determine heteroatomic dispersion coefficients. The A ij parameter, which describes the short-range repulsion between overlapping electron clouds, cannot be readily calculated directly from the electron density. Instead it is computed by requiring that the minimum in the interatomic Lennard-Jones potential coincides with the estimated van der Waals radius of the atom in the molecule. 24 This non-bonded parameter derivation scheme requires just one fitting parameter per element (corresponding to the van der Waals radius of the free atom in vacuum).

Bond and Angle Parameters
The parameterization approach used to determine bond and angle harmonic force constants in traditional force fields, such as OPLS and AMBER, is to fit them to reproduce QM data or experimental normal mode frequencies. This creates interdependencies in the force field parameters. That is, bond and angle parameters depend on the non-bonded and torsional parameters used during the fitting process, and therefore they cannot be easily transferred to the QUBE force field. Instead, we derive bond and angle force constants directly from the QM Hessian matrix of the molecule under study, while equilibrium bond lengths and angles are taken from the optimized geometry of the molecule. 30 This method is a modification of the Seminario method, 32 and is based on the computation of the eigenvalues and eigenvectors of the partial Hessian matrix, k AB : The harmonic bond force constants are given by: whereû AB is a vector in the direction of the bond AB, and ν AB Similar methods may be used to derive the angle force constants, and we introduce a correction to the standard Seminario method which takes into account the geometry of the molecule under study. 30 Consistent improvements in the computed normal modes were demonstrated using this modified Seminario method for a range of molecules. In particular, QM vibrational frequencies for a set of dipeptides were reproduced with an accuracy of 40 cm −1 , which compares favorably with the OPLS force field (47 cm −1 ) and the original Seminario method (104 cm −1 ). Since the derived bond and angle parameters do not depend on the choice of non-bonded and torsion parameters, the derived parameters are suitable for use in the QUBE protein force field. Since large-scale polarization effects are expected to be significantly less important for bond and angle parameters than for charges, we use the library of bonded parameters provided previously 30 (and the same atom types as those used for OPLS-AA/M 1 ).
Whilst preparing the protein simulations, it was found that our library was missing parameters for the disulfide bridge between pairs of cysteine residues. These bond and angle parameters were therefore derived using the QM Hessian matrix of dimethyl disulfide and are supplied in Section S2.3 of the Supporting Information.

Torsional Parameters
. X γ and Y δ (not shown) are the neighboring heavy atoms in the side chains (if present).
With new bond, angle and non-bonded parameters derived, all that remains to complete the QUBE protein force field is to obtain the torsional parameters. Unfortunately, it is infeasible to derive torsional parameters from QM simulations that are specific to each protein.
Therefore, we make the assumption that the derived parameters for dipeptides are transferable to proteins, and in Section 4 we test the limitations of this assumption by validating the force field against experimental peptide and protein dynamical observables.
The torsional terms in a force field are a function of the dihedral angles (φ) in a molecule.
Here, we use the same functional form as the OPLS force field: where the sum runs over all dihedrals (k) in the molecule and V k 1−4 are parameters to be fit. We focus on reparameterizing the backbone (φ, φ , ψ and ψ ) and sidechain (χ 1 , χ 1 , χ 2 and χ 2 ) torsional parameters ( Figure 1). Re-fitting of the remaining torsional and improper parameters are beyond the scope of the current work, and these parameters are instead taken from the OPLS-AA/M force field. 1 However, parallel efforts are being made to develop a toolkit for automated parameterization of small molecules using the QUBE force field, which will facilitate derivation of the remaining parameters in future. 25 When fitting torsional parameters, the main objective is to minimize the difference between MM and QM gas phase dihedral energy scans. However, weighting schemes and regularization can also be used to change the form of the error function that is minimized.
Regularization is a technique generally used to prevent overfitting to data. There are multiple forms of regularization that can be applied to improve fitting. 2,25 In this work we use a harmonic restraint that is added to the error function. 2 This penalty term ensures that torsional parameters do not deviate significantly from their initial value unless a significant improvement in the agreement with the QM energy surface is observed. As we will show, even with low levels of regularization (a small λ value), a sizeable increase in performance is observed for the QUBE force field. The general form of the error function used in this study is given by: where k B is the Boltzmann constant, T is a weighting temperature, n is the number of points at which the energy is evaluated, W j is the contribution from the weighting scheme, λ is the regularization coefficient (this term is independent of n and can take any positive value), V i is the torsion parameter being optimized and V 0 i is an initial estimate of the torsion parameter. Where we have used a harmonic restraint, we have used V 0 i = 0 as the initial guess as previously suggested. 41 The V 4 term was set to zero throughout the fitting procedure to avoid overfitting. 1  3 Methods

Torsional Parameter Fitting
Torsional parameter fitting followed the general strategy employed in the development of the OPLS-AA/M force field, 1 amongst others, in which parameters are fit to reproduce QM gas phase potential energy surfaces. Fitting and validation was performed using dipeptides of the form (Ace-X-NMe), where X is the amino acid, Ace is an acetyl group and NMe is the N-methyl group.

QUBE Parameter Derivation
The ground state electron densities of the dipeptides were computed using the ONETEP linear-scaling DFT code 33 with the PBE exchange correlation functional and standard parameter settings, (Supporting Information S2.2). 24 Since the reference QM potential energy scans are performed in the gas phase, we have decided to derive QUBE force field charges and Lennard-Jones parameters from the vacuum electron density (rather than in an implicit solvent). The assumption here is that the required correction to the MM potential energy surface is approximately the same in the gas and condensed phases.
Charge and Lennard-Jones parameters were derived from the QM ground state electron density using the DDEC scheme 26,27 as implemented in the ONETEP code 28,34 (Section 2.1). As discussed previously, DDEC charges show low, but non-zero, conformational dependence. 28 To account for this, the non-bonded parameters were derived for multiple con- A 2D scan of φ (C-N-C α -C) and ψ (N-C α -C-N) was carried out for alanine and glycine.
The sidechain energy scans follow the same methods as the backbone scans, except that the single point B2PLYP calculation was not performed for all χ 2 scans, as ωB97X-D was shown to give sufficiently accurate results. 1 These 1D scans give the energy as a function of . The ψ and φ angles were constrained to an α-helical In this work, an additional 2D scan of the φ and ψ dihedral angles of serine was found to be necessary for accurate torsional parameters for serine and threonine, which both have a polar oxygen atom at the X γ position. This followed similar protocols to those previously described, however the sidechain χ 1 angle now had to be taken into account. Scans were performed at 30 • increments of φ/ψ, for χ 1 initially set to -60 • , 60 • and 180 • . This gave three 2D energy scans for the main rotamers of serine. The minimum energy structure for each φ/ψ angle was then used to construct the overall minimum φ/ψ potential energy surface (Section S3.1).

Fitting Dipeptide Torsional Parameters
Torsional parameters were optimized by minimizing the error function shown in eq 5 using a steepest descent algorithm. MM potential energy surfaces were computed by scanning dihedral angles in 15 • increments using the BOSS software. 43 The backbone torsional parameters for all dipeptides tested, excluding serine and threonine, were fit to the alanine and glycine scans previously described. The total error for the two scans was given by: with the prefactors corresponding to the relative frequency of each amino acid in the human proteome. Preliminary testing (Section S1.1) showed that a weighting function and regularization did not significantly improve the conformations sampled during the dipeptide MD simulations and so were not used (λ = 0, W = 0).
The remaining dipeptides, threonine and serine (both of which contain aliphatic hydroxyl groups in their sidechain), were assigned identical backbone parameters that were fit to reproduce the QM scans of serine. For these scans, regularization and weighting were shown to be necessary to produce dipeptide dynamics which were in agreement with experiment.
An investigation of how the simulation error changes with regularization is given in Section S1.2. The harmonic restraint parameter was set to λ = 0.05.
The sidechain scans for all dipeptides followed the same fitting process as the alanine/glycine backbone with no weighting or regularization used. As atom-typing is not used for the non-bonded parameter assignment in the QUBE force field, each set of sidechain torsional parameters is also residue-specific. This differs from the approach used in OPLS-AA/M in which sidechain torsional parameters with the same set of atom types are generally assigned the same parameters.
In a number of cases it was necessary to make manual changes to the fitting process.
This was restricted to setting a number of torsion parameters to zero (that is, λ = ∞) and reducing the number of scans used in the fitting process. In particular, the aspartic acid ψ/φ distribution was improved by setting the χ 2 torsional parameters to zero. Additionally, using only the QM energy scan with the lowest minimum energy in the fitting process was shown to result in an improvement in the MD simulations of the dipeptides for the χ 1 torsional parameters of cysteine, methionine, serine and threonine. The need for manual input in the fitting process was also required for developing OPLS-AA/M and is likely due to the restrictive functional form of the torsional potential and the conformational dependence of the energy scans. 1 The full set of manual changes involved is listed, along with the final torsional parameters, in Section S3.3.

Alanine Pentapeptide and Glycine Tripeptide
As the non-bonded parameters used are specific to the system under study, they are not the same for an alanine dipeptide molecule as for the alanine pentapeptide (Ala 5 ). The alanine residue in the dipeptide is blocked by acetyl and N-methyl groups whereas the central three alanine residues in Ala 5 have neighboring alanine residues on both sides. Therefore, varying environments exist for alanine residues in the different molecules. Consequently, the parameters found for the alanine dipeptide were found to be unsuitable for MD simulations of Ala 5 (Table S5). However, the use of a harmonic restraint in the fitting process resulted in torsional parameters that were sufficiently accurate for the alanine and glycine peptide simulations. Alanine and glycine backbone torsional parameters were refit to the QM energy scans with λ = 0.50, no weighting was used. The optimal value used for the regularization parameter was found by minimizing the differences between simulated and experimental NMR observables for Ala 5 (Table S5). We note that the J coupling error is not too sensitive to the strength of the harmonic restraint. Separate torsional parameters are used for the alanine pentapeptide and glycine tripeptide (Gly 3 ), as residue-specific parameters should result in a more accurate force field.

Protein Torsional Parameters
It is expected that optimal backbone torsion parameters for protein simulations are more similar to those developed for Ala 5 and Gly 3 than for the set of dipeptides. We therefore use a regularization λ = 0.50 for all protein backbone torsional parameter derivation.
Alanine, glycine, serine, and proline torsional parameters are fit to available QM potential energy surfaces and are therefore residue-specific. Threonine uses torsional parameters fit to the serine torsional scan, and all other amino acid use torsional parameters fit to joint alanine/glycine energy scans. These backbone parameters are combined with the dipeptide sidechain torsional parameters to give the full QUBE protein force field parameter set.   In order to provide a consistent and computationally efficient approach to assigning the non-bonded parameters, we recommend minimizing the experimental structure using a standard transferable force field in explicit water prior to the DFT calculation. In this study, we used the OPLS-AA/M force field for the initial minimization. Following non-bonded parameter assignment, bond, angle and torsion parameters were assigned as described in Section 2 based on the OPLS-AA/M atom types. For torsion and improper types not re-parameterized in this study, OPLS-AA/M parameters are retained. 1,7 Table 1 summarizes the number of bespoke non-bonded parameters for each protein studied, along with the bonded parameters that are parametrized using the dipeptide molecules as described above. All parameters, including atom-specific non-bonded parameters, are written to a CHARMM-style parameter file. The psf, pdb and inp files are provided in the Supporting Information. We note that preparation of the parameter files is fully automated, and scripts and step-by-step tutorials are available from https://github.com/colegroup/QUBEMAKER. MD simulations were performed using the NAMD software, using input parameters detailed elsewhere (Section S2.1). 1 Statistics were collected over a period of 200 ns for dipeptides and Ala 5 and Gly 3 , and 0.5 µs for the proteins. All MD simulations were performed in triplicate.

Simulation Analysis
All backbone and sidechain dihedral angles sampled during the dipeptide simulations were analyzed and compared with experimental data (Section S2.4). The simulated J coupling was calculated using the Karplus equation: where φ is the relevant dihedral angle and A,B,C are the Karplus parameters which can be derived from experimental measurements or QM calculations. 48 For the alanine pentapeptide, the J coupling error is calculated as: where < J j > sim is the time-averaged simulated J coupling, J j,exp is the experimental J coupling, σ is the estimated systematic error 48  glutamic acid scans, and the glutamine χ 2 scan. For glutamic acid, the error is also high for the OPLS-AA/M force field parameters, but the rotamer populations remained close to the experimental data, and this may be due to a problem with the functional form used in classical force fields. 1 The OPLS-AA/M error in the potential energy scan for glutamine is roughly half that of the QUBE force field. However, as we will show, the accuracy of the glutamine dipeptide MD simulations is good, and so no further refinement was made to the sidechain torsional parameters in this work.
Although a low error in the reproduction of the QM potential energy surface is clearly the desired result, this does not necessarily correspond to accurate non-bonded force field parameters. The degree to which torsional parameters can improve the fit between MM and QM scans depends not only on the accuracy of the non-bonded, and bond and angle, parameters, but also on the shape of the energy difference between the QM and MM scans.
The functional form used in classical MM force fields is very restrictive. However, the energy difference between the QM and MM energy scans must be corrected by the functional form for low errors to be achieved. Therefore, although we use errors in potential energy scans as a guide to performance, they cannot be relied upon as a measure of the accuracy of a force field. Therefore, we now investigate the performance of the QUBE force field in MD simulations.

Dipeptide Simulations
Extensive MD simulations were performed for each of the dipeptides. Computed NMR J couplings provide a quantative measure of the accuracy of the backbone φ torsion angle distribution sampled during the MD simulations. The full results are shown in Table S8.  for other force fields. 55 However, since the occupancy of PPII and β regions always remain higher than the the populations of the left-handed α-helical region, reducing the left-handed helical population was not considered a priority in this work. The right-handed α-helical populations are small for all the dipeptides. This is in agreement with experimental results. 50 Figure 3: a) The φ/ψ distribution extracted from the dipeptide MD simulations, plotted in the form −log(p ψ,φ ) (where p ψ,φ is the probability of a region being occupied). The lighter regions correspond to low probability areas including conformations that are not sampled during the simulation. b) The major conformations observed in protein structures. 54 The right-handed α-helical region is labelled as δ and the left-handed α-helical region as δ .
As well as the backbone conformations sampled, the sidechain rotamer populations were also analyzed. In Figure 4, the simulated rotamer populations are compared to experimental data taken from protein coil libraries (the data are given in the SI of Ref. 1). Given that the experimental data are not specific to dipeptides, perfect agreement is not expected.
However, populations at extreme values would cause concern and a correlation between The rotamer data, which were used to construct Figure 4, are reproduced in Table S9.
With a MUE of 14%, QUBE performs better than both OPLS-AA and OPLS-AA/L, which have errors of 23% and 21% respectively. 1 The error is not as low as OPLS-AA/M, which has an error of 10%, however with further empirical changes to the torsional parameters the error could likely be further reduced. Examining individual dipeptide errors, protonated histidine and aspartic acid are found to have the highest errors. The protonated histidine experimental data includes all ionization states of histidine and therefore may not be accurate, which would explain the high error. The higher error in the simulated dynamics of the aspartic acid dipeptide is more problematic and, in future versions of the QUBE force field, further changes to these sidechain torsional parameters may be considered. Table 2: The J coupling error for the alanine pentapeptide simulation. The Karplus parameter sets are the same as those used previously. 1 The values shown in parentheses correspond to the J coupling errors excluding 2 J(N, C α ). In the simulations carried out in this work, as well as the work of Amber ff15ipq and OPLS-AA/M, the low β backbone populations present result in a high 2 J(N, C α ) error for the second and third set of Karplus parameters. 1,2 The pentapeptide conformation with the largest population is PPII with 62 ± 2 % of the simulation spent in this conformation (Table   S11). This is similar to the conformational propensity observed for OPLS-AA/M (53.5 ± 0.2 %). Both force fields also result in a low α-helical population, which is consistent with experimental data. 15 Figure 6: The ψ and φ distribution for the glycine tetrapeptide (all residues are included) using a) the QUBE force field and b) OPLS-AA/M, plotted in the form −log(p ψ,φ ) (where p ψ,φ is the probability of a region being occupied). The lighter regions correspond to low probability areas including conformations that are not sampled during the simulation.

Peptide Simulations
The problems associated with using the Karplus parameters for Gly 3 are discussed in Section S9 of the Supporting Information and elsewhere. 1,51,56 Therefore, we evaluate the backbone conformations of Gly 3 through its φ/ψ distribution alone. In Figure 6, the OPLS-AA/M backbone conformational distribution is compared to that obtained using the QUBE force field. A lower α-helical population is occupied by the QUBE force field, but otherwise both distributions are very similar.
The MD simulations presented here have demonstrated that the QUBE force field, and the parameterization methods used to create it, are sufficiently accurate to recreate conformational propensities of short, flexible peptides. The errors in the simulated dynamics of these molecules are comparable to OPLS-AA/M, and the φ/ψ distributions demonstrate that the major conformations observed in protein structures are populated. Issues with the transferability of torsional parameters have already been identified from the longer peptide simulations, and are solved by applying regularization. In the following subsection, the performance of QUBE for entire proteins is evaluated to demonstrate the feasibility of applying the methodology to macromolecules and to further understand the intricacies of fitting torsional parameters to a system-specific force field. Figure 7: A comparison of the QUBE and OPLS non-bonded parameters for ubiquitin. The regions circled in correspond to carbonyl carbon atoms, which are expected to be electron deficient and therefore require small A and B Lennard-Jones coefficients. 24 Blue and dashed black lines represent lines of best fit and y = x respectively. The use of system-specific non-bonded parameters for biomolecular force fields allows for long-ranged polarization effects to be included, which is expected to improve the accuracy of the force field, particularly for measurements such as protein-ligand binding affinity that are sensitive to the electrostatic potential at the protein surface. A comparison of the QUBE and OPLS non-bonded parameters for ubiquitin is shown in Figure 7. Figures for the other proteins tested follow similar trends. As we have described, QUBE non-bonded parameters are derived directly from the QM partitioned electron density, and so, each atom has a unique charge and set of Lennard-Jones coefficients which depend on its environment. In contrast, the OPLS parameters are read from a library of atom types. The QUBE and OPLS charges correlate well with no clear outliers. As has previously been observed, 24 the QUBE Lennard-Jones parameters show a far greater level of variation than OPLS (and most other force fields).

Protein Dynamics
One assumption employed in the use of system-specific charges for proteins (and small molecules) is that the derived parameter set is not too dependent on the molecular conformation. To investigate this assumption, the sensitivity of the non-bonded parameters, for the GB3 protein, to the choice of input structure is investigated in Section S6.2. Ten structures were extracted from a MD simulation employing the OPLS-AA/M force field, and QUBE non-bonded parameters were computed for each snapshot. The standard deviation of the charge distribution across the ensemble is just 0.02 e, supporting previous observations that the underlying DDEC atoms-in-molecule charges are relatively independent of conformation. 27,34 It is important to test whether these system-specific force field parameters translate into more accurate protein interactions and dynamics. In this regard, although the conformational preferences of the peptides tested in the previous section are promising, it is not known whether the torsional parameters will continue to be appropriate for use with proteins. As the non-bonded parameters vary with the system studied, the transferability of torsional parameters cannot be readily assumed. To assess this we begin by studying MD simulations of the proteins ubiquitin and GB3.
The J coupling errors for ubiquitin and GB3 are summarized in Table 3. With an overall RMSE of 1.54 Hz, the error using the QUBE force field for ubiquitin is higher than that of OPLS-AA/M, which has an RMSE of 1.12 Hz, but lower than OPLS-AA and OPLS-AA/L with errors of 1.84 Hz and 1.70 Hz respectively. 1 GB3 follows the same trend with an RMSE  The J coupling results suggest that whilst the transfer of torsional parameters from dipeptides to proteins may cause some issues, the QUBE force field remains more accurate than OPLS-AA and OPLS-AA/L. This is promising when we consider that OPLS has been in development for many years with multiple iterations and parameter adjustments performed.  Figure 8 shows the five proteins tested, with the residue labels indicating the main regions that deviated from the crystal structure during simulation. Figure 9 shows the average RMSD of the C α atoms of the five proteins from the experi- The φ/ψ distributions of GB3 ( Figure S8) and the RMSD per residue ( Figure S7) help us to further analyze the results. Residues which deviate from the experimental structure, and the results from the AMBER ff15ipq force field, also tend to have high J coupling errors.
For example, residues 8-21 have a large deviation from the experimental structure and this is reflected by high J coupling errors in this region. The backbone J coupling error, using the 2007 Karplus parameters, for residues 8-21 is 2.00 Hz which is almost double the total backbone error. This region corresponds to a β-sheet, which separates over the second half of the MD trajectory and contributes to the increased backbone RMSD. Aside from this region, the only other residues which show noticeable deviation from the crystal structure are Val39, Asp40, Gly41 and Thr55. However, the small deviations that are present in these four residues are also observed in simulations using the AMBER ff15ipq force field. 2 Deviations between the simulated ubiquitin dynamics and experiment tend to be confined to regions without clear secondary structures. Often this is of little concern since both experimental NMR measurements and simulations with the AMBER force field also indicate flexibility in these regions (e.g. residues 7-11 and 72-74). However, deviations from the crystal structure in the disordered region between residues 54-66 is more of a concern, and contributes to the high J coupling and rising RMSD of the protein backbone over the second half of the simulation.
In contrast, Figure 9 shows that both the villin headpiece (PDB: 2F4K) and BPTI (PDB: 5PTI) retain their experimental structures extremely well (average RMSD in the range 1-2Å). The three α-helices present in the villin headpiece are retained throughout the simulations, and the φ/ψ distributions are in excellent agreement with experiment and the AMBER force field ( Figure S10). Similarly, in BPTI, regions with helical or β-sheet structures retain their structure. Some small changes in structure are observed (for example, residues 10-12 in villin and 36-40 in BPTI), though these correspond to regions with no fixed secondary structure or a bend.
We have also included in Figure 9 the protein binase (PDB: 1BUJ). This is an interesting test case as the experimental NMR structual ensemble reveals a high degree of residue flexibility, with loops that adopt multiple conformations ( Figure S12), but it is also challenging.
The two α-helices present in 1BUJ, around residue 10 and residue 30, are generally well represented with the QUBE force field. However, in regions with no structure, a bend, or a turn significant deviations from experiment are observed and the RMSD reaches extremely high values. By way of comparison, the backbone RMSD using the AMBER ff15ipq force field was 3.4Å after 10µs of simulation, and after 0.5 µs was approximately 3Å. In the AMBER ff15ipq work, the high RMSD was attributed to variability in the loop regions, which had also been observed in experimental structures. 2 Closer examination of the φ/ψ distributions of each residue reveal the difficulty of capturing accurate conformational preferences. For example, the NMR ensemble for Lys38 shows the presence of both β-sheet and α-helical conformations. These are also observed in MD simulations using both AMBER ff15ipq and the QUBE force field, but the proportion of each conformation is different. The ensemble of Ser66 is not well represented with AMBER ff15ipq, but with the QUBE force field all structures in the ensemble are captured to some degree, although an additional α-helical conformation is also observed. Table 4 summarizes the performance of the QUBE protein force field across the peptide and protein datasets presented in this study, and compares it with analogous assessments of the OPLS force fields studied previously. 1 Overall, QUBE out-performs legacy OPLS-AA and OPLS-AA/L force fields, but is less accurate (with the exception of alanine pentapeptide simulations) than the latest OPLS-AA/M force field. This is an encouraging result for this first generation protein-specific force field, but also indicates that there is room for improvement of QUBE within the fixed functional form of the biomolecular force field. The next section summarizes our roadmap for future development and application of the QUBE force field.

Discussion and Conclusion
The assumption that biomolecular force fields must be parametrized against the experimental properties of small molecules has persisted since MM simulations began and remains in all force fields under widespread use. 6,57,58 In this work, we look to challenge this assumption by deriving system-specific non-bonded parameters, from linear-scaling QM simulations, for consistency with the QUBE small molecule force field. These non-bonded terms were used here alongside libraries of (non-bespoke) bond and angle parameters, derived using the modified Seminario method, 30 and newly reparametrized torsional terms.
We have shown here that using system-specific non-bonded force field parameters can result in accurate conformational preferences for short peptides. Rotamer populations and simulated J couplings for the dipeptide molecules are in good agreement with experimental data and compare favorably with the latest OPLS force field. For longer peptide molecules, the problems associated with fitting torsional parameters to a system-specific force field became more apparent. Using regularization in the fitting process was shown to overcome these issues and resulted in a J coupling error of just 0.90 ± 0.03 for the alanine pentapeptide. Further work investigating disordered peptides will ascertain how general this fix is.
The accuracy of the peptide simulations supports the use of our non-bonded and modified Seminario bonded parametrization strategies. In protein MD simulations, the RMSD of the backbone atoms relative to experimental structures remained low, below 2Å, for two of the five proteins tested. The α-helices present in all of the proteins generally remained close to the experimental structures, but the β-sheets exhibited greater loss of structure, and regions with no clear structure or exhibiting a turn regularly deviated from the starting structure.
These regions also contributed greatest to J coupling errors. Despite this, the majority of the regions in the proteins retained their experimental structure and the J coupling errors for GB3 and ubiquitin were below those of OPLS-AA and OPLS-AA/L, two force fields regularly used in biomolecular modelling studies.
Whilst developing QUBE, manual adjustments to some torsional parameters were required. This was also required in the development of OPLS-AA/M, and we can infer from this that automatically fitting backbone and sidechain torsional parameters using dihedral energy scans is still challenging. The most obvious failure of dihedral energy scan fitting was that for a number of sidechains it was more accurate to set the torsional parameters to zero than to use the originally derived terms. This is in part due to the functional form used, with potential improvements to this discussed below, but is also due to the poor sampling of relevant structures by scanning one or two dihedral angles at a time. This problem is reduced by the iterative fitting methods used in AMBER ff14ipq and ff15ipq, 2,38 which sample the structures used for torsional parameter fitting by performing MD simulations with the current iteration of the force field. This approach will be considered in future versions of the force field.
Another potential source of error is the choice of modified Seminario method for derivation of bond and angle force constants. However, this method has been shown to accurately reproduce QM vibrational frequencies, 30 and importantly also reproduce QM intramolecular potential energy surfaces of drug-like molecules when combined with the QUBE non-bonded and torsion parameters. 25 There are also additional considerations involved in using a library of torsional parameters alongside system-specific non-bonded parameters. The torsional parameters are fit using one set of non-bonded parameters but are then used for a range of environment-dependent nonbonded terms. This is likely the reason for the importance of regularizations in this study.
During the sidechain torsional parameter fitting process it was observed that the optimal torsional parameters for α-helical and β-sheet backbone conformations can vary greatly.
It may be possible to address this issue by changing the functional form of the torsional component of the force field. The functional form currently used is inaccurate due to the parameter dependency on only a single dihedral angle. The coupling between torsional terms has been addressed in a number of different ways. [59][60][61][62] These include the use of the CMAP term in CHARMM22, a grid based correction used to improve the backbone torsional energetics. 61 Extending the CMAP correction so it is dependent on the χ 1 sidechain dihedral angle has also recently been investigated. 62 The functional form could also be improved by adding a torsion-torsion coupling term as employed in previous studies. 60 Importantly, the flexibility of the QUBE parametrization process means that changes to the torsional parameters are not the only alterations that could be made to improve the accuracy of the benchmark validation tests studied here. It is not just protein-protein interactions that determine structure, but also interactions with the water model. In particular, the balance between electrostatic and dispersive interactions has been shown to be crucial. 63,64 Interactions between the QUBE force field and the TIP3P water model may be responsible for some of the instabilities in structure that we have observed, and development of a QUBE water model may lead to improved dynamics and computed free energies of hydration. 25 An advantage of using a parametrization scheme that depends almost entirely on QM data is that alterations to the parametrization strategy or functional form can be readily inserted into the existing workflow. There is future scope for improvement in the choice of exchange-correlation functional used to derive non-bonded parameters, for example through the use of hybrid functionals, 65 or the choice of electron density partitioning methods, 66 or the addition of off-center charges to model electron anisotropy effects. 25 More fundamentally, we have the opportunity to investigate improvements to the functional form of the force field itself, for example, by adding higher order dispersion terms beyond the dipole-dipole interaction 37,66,67 or by altering the short-range repulsion term. 37 A future QUBE polarizable force field is also envisaged and, towards this goal, the derivation of accurate atoms-in-molecule atomic polarizabilities is under investigation. 68 As presented in the Results section, the general picture that emerges is that this first generation quantum mechanical bespoke force field is an improvement over legacy OPLS-AA and OPLS-AA/L force fields, but is out-performed by the most recent OPLS-AA/M force field for simulated dynamics of folded proteins in their native state. While we have previously shown that DDEC charges are not too dependent on small conformation changes, 28 further investigation is needed to establish the utility of QUBE for protein folding simulations.
Hence, although we have outlined our roadmap to future improvements, a natural question is: where can the QUBE protein force field be used now (especially given the higher cost of parameterization compared to transferable force fields)? Importantly, it has been shown previously that the use of system-specific force field charges leads to improvements in binding energetics of small molecules, 8 and reproduction of the QM electrostatic potential for both small molecules 27 and proteins. 34 Therefore, although simulated protein backbone dynamics is an important test, we envisage the QUBE small molecule and protein force field being particularly important for the study of intermolecular interactions in the condensed phase.
Indeed, QUBE was originally developed to provide, by construction, a compatible protein and small molecule force field for computer-aided drug design, where an accurate surface electrostatic potential of the protein is crucial. In this regard, the absolute binding free energies between the L99A mutant of T4 lysozyme and six benzene analogs have been recently computed using QUBE with a mean unsigned error of 0.85 kcal/mol, which compares very favorably with OPLS-AA/M (1.26 kcal/mol). 69 Although further work is required to establish this accuracy across a significantly wider range of protein-ligand complexes, the promise of these initial biomolecular simulation results indicate a viable pathway toward improved protein dynamics and interactions using quantum mechanical bespoke force fields.