Fragment-Based Calculations of Enzymatic Thermochemistry Require Dielectric Boundary Conditions

: Electronic structure calculations on enzymes require hundreds of atoms to obtain converged results, but fragment-based approximations offer a cost-effective solution. We present calculations on enzyme models containing 500 − 600 atoms using the many-body expansion, comparing to benchmarks in which the entire enzyme − substrate complex is described at the same level of density functional theory. When the amino acid fragments contain ionic side chains, the many-body expansion oscillates under vacuum boundary conditions but rapid convergence is restored using low-dielectric boundary conditions. This implies that full-system calculations in the gas phase are inappropriate benchmarks for assessing errors in fragment-based approximations. A three-body protocol retains sub-kilocalorie per mole fidelity with respect to a supersystem calculation, as does a two-body calculation combined with a full-system correction at a low-cost level of theory. These protocols pave the way for application of high-level quantum chemistry to large systems via rigorous, ab initio treatment of many-body polarization.

F ragment-based approximations 1−6 are an attractive way to circumvent nonlinear scaling of computational quantum chemistry (QC), whose floating point cost normally grows like N ( ) p as a function of system size (N), with exponents (p) ranging from 3 for density functional theory (DFT) to ≥7 for levels of theory that can provide thermochemical benchmarks.
Fragmentation into N F separate subsystems (fragments), each of size n, reduces that cost to N n ( ) p F × in a manner that is amenable to distributed computing and does not require modification to electronic structure codes.−8 The present work presents a protocol for using fragmentation to compute reaction energies and activation barriers for enzyme-catalyzed chemical reactions.Over the past decade, benchmark calculations have revealed that enzymatic thermochemistry does not converge until hundreds of atoms are included in the QC calculation, 9−17 which is much larger than is typical in contemporary quantum mechanics/molecular mechanics (QM/MM) calculations.Fragmentation may therefore offer an efficient route for obtaining converged thermochemistry (for N > 500 atoms) at benchmark levels of theory, provided that errors associated with fragmentation can be controlled.We demonstrate that these errors can be reduced below the "thermochemical accuracy" threshold of 1 kcal/mol.At the same time, our calculations reveal that straightforward comparison of fragment-based approximations to full-system benchmarks (as a means to assess error) is illposed if the calculations are carried out under vacuum boundary conditions.
We will consider models of enzyme−substrate complexes containing approximately 500−600 atoms.Total energies are approximated by means of a many-body expansion (MBE), Individual terms are for the two-body corrections, where E IJ is the energy of the dimer formed from fragments I and J, and for the three-body corrections.When eq 1 is truncated at nbody terms, we denote the resulting approximation as MBE(n).−36 Fragments I, J, K, ..., are taken to be individual amino acids of the enzyme (except where stipulated otherwise, for testing purposes), with the substrate as its own fragment.Although larger fragments have sometimes been used for proteins, 37 we are able to achieve our target accuracy of 1 kcal/mol using single-residue fragments.(The treatment of the substrate is discussed below.)−28,38−45 This can be motivated in terms of a generalized (G)MBE, 1,6,46,47 but most overlapping fragment applications to date have used a one-body approach that captures throughbond interactions but not through-space interactions. 6A twobody GMBE can capture both, but is relatively expensive in terms of the number of subsystems that are generated. 31,39As such, we stick to the simple MBE(n) approach in this work.
−16 A Mg 2+ ion in the active site is essential to COMT's function 56 but leads to charge-transfer effects in QC calculations that can significantly alter the barrier height, depending on the size of the model system. 14,52,55 − )], two aspartic acid residues, an asparagine residue, and a water molecule (58 atoms).Reactant, product, and transition state structures for the transfer of a methyl from S-adenosyl-Lmethionine (SAM) to catecholate were protonated and relaxed as described in Computational Details.All calculations were performed at DFT levels of theory, so that we may obtain energies for the full enzyme−substrate complex at the same level of theory and thereby examine convergence of the MBE toward a well-defined target.As such, the errors discussed below are defined with respect to a supersystem calculation at the same level of theory.
The overall charge on this QM model is −1, but the system contains nine fragments with non-zero charge.Small anions in the gas phase are sometimes inherently unstable (or metastable), as in the case of SO 4 2− , 57,58 and delocalization errors in DFT can exacerbate this problem. 58−61 This is not always a viable or realistic option, however, as charged side chains may be directly involved in stabilizing the protein structure or binding to a ligand (as in the present example), or may be vital to a reaction mechanism.A general procedure for enzymatic thermochemistry must admit the possibility of fragments with non-zero charge.
When we naively apply MBE(n) to a large COMT model with charged residues, however, we find that convergence is erratic.This is shown for the barrier height (activation energy E a ) in Figure 1a, where MBE(2) overestimates the barrier by 5.4 kcal/mol but MBE(3) underestimates it by 16.7 kcal/mol.To verify that charged residues are the problem, we prepared a second model of COMT in which nearest-neighbor fragments are combined to neutralize charge; e.g., a negatively charged aspartic acid residue is combined with a positively charged ligand, forming a single fragment.This increases the largest fragment size from 58 to 124 atoms but does not alter any protonation states.Using this "charge-coordinated" model of COMT, we observe rapid convergence of MBE(n), such that two-and three-body calculations afford essentially identical values for both E a and Δ rxn E (see Figure 1).Even one-body calculations perform reasonably well for the charge-coordinated model, due to the larger fragment size, but enlarging the fragments is not an attractive strategy for levels of theory beyond DFT.
Each of the calculations described above was performed using vacuum boundary conditions.As an alternative, we introduce low-dielectric boundary conditions using a polarizable continuum model (PCM). 62−74 The precise value of ε may matter for pK a calculations, but E a and Δ rxn E often converge quickly as a function of ε, such that results for ε = 4 are indistinguishable from much larger values. 75,76That is indeed the case for COMT, as shown in Figures S2 and S3.Gas-phase energetics (ε = 1) are an outlier, but ε = 4 is nearly indistinguishable from ε = 32.Subsequent PCM calculations in this work use ε = 4.
When the subsystem calculations required for MBE(n) are performed using PCM boundary conditions with ε = 4, and results compared to a supramolecular calculation with the same boundary conditions, we observe excellent convergence of MBE(n) even in the presence of single-residue fragments with net charge.The results for several other DFT functionals and basis sets are listed in Tables S1 and S2, and we note that stable results are obtained even when the basis set contains diffuse functions.−79 ) In Figure S4, we extend some of these results to n = 4 in order to check convergence.Using PCM boundaries, the difference between

The Journal of Physical Chemistry Letters
the MBE(3) and MBE(4) results is ≲1 kcal/mol, while gasphase calculations sometimes afford errors of >150 kcal/mol at the four-body level!MBE(3) calculations with low-dielectric boundary conditions consistently provide sub-kcal/mol fidelity for various functionals and basis sets, whereas MBE(3) with vacuum boundary conditions affords errors of 10−30 kcal/mol in many cases.
These results suggest that large errors for enzymatic thermochemistry obtained using MBE(n) with vacuum boundary conditions originate not from the fragmentation approximation, nor do they arise from the simple hydrogen atom caps that we use to saturate the severed valencies.(Our approach to capping is less sophisticated than the use of "conjugated caps" that attempt to replicate amino acid moieties, [25][26][27][28]80,81 yet our results demonstrate that sub-kcal/ mol accuracy can be achieved with hydrogen atom caps.)Instead, large and oscillatory errors in MBE(n) calculations under vacuum boundary conditions arise due to inconsistent charge delocalization in the n-body calculations. To obtai a polarization environment that is comparable to that of the supersystem, high-order n-body calculations are required, beyond n = 4.Alternatively, dielectric boundary conditions provide a simple and low-cost means of mimicking this polarization.In principle, one might consider the use of heterogeneous dielectric boundaries, 82−85 such that hydrophobic parts of the protein are treated differently from solventexposed portions.This has not been pursued in the present work, where we simply aim to demonstrate that convergence of the MBE in vacuo is not well-defined.
To confirm this explanation, we also examined a different enzymatic reaction that does not involve charged moieties near the active site.For this example, we chose the decarboxylation of L-apartate by the enzyme L-aspartate α-decarboxylase (AspDC), which has also been studied using QC models of varying size. 86Here, we consider only the C−C cleavage step, using a model consisting of 30 monomers (511 atoms), corresponding to a 5 Å radial cutoff around the active site of the relaxed crystal structure.This system has zero net charge but two ionic amino acids, which were placed together in a single fragment to avoid having any charged fragments.Results for E a and Δ rxn E (Figure 1) demonstrate that n-body results converge similarly for both vacuum boundary conditions (ε = 1) and PCM boundary conditions with ε = 4, although the PCM-based error is smaller at the n = 2 level.Unlike the charge-coordinated results for COMT, where the fragments are large and thus many-body effects are small, here the n = 1 results are unacceptable but two-body results with lowdielectric boundary conditions are rather good.
Together, these results demonstrate that application of MBE(n) with vacuum boundaries to large enzyme−substrate complexes need not converge to the supermolecular result at low orders (n ≤ 4).Oscillatory behavior results from inconsistent charge delocalization across the subsystems (monomer to dimer to trimer, etc.) when the fragments have net charge.Low-dielectric boundary conditions with ε ≈1.5 have been shown to reduce the density delocalization error in isolated peptide DFT calculations, 87 and in the present context the use of ε = 4 appears to prevent oscillatory changes to corrections ΔE IJ , ΔE IJK , etc.
Given a two-or three-body approximation for a large enzyme model with charged side chains, one might worry about neglect of long-range interactions.We address this by assessing a multilayer fragmentation scheme in which a low-level calculation on the entire system is used to correct for errors introduced by fragmentation, while the subsystems are described at a higher level of theory.This strategy has been suggested by others under various names, 88−90 and is illustrated in Figure 2 by analogy to the "ONIOM" approach to QM/MM calculations. 91Both the subsystems and the supersystem are computed at the lower level of theory, and the difference between low-level supersystem and low-level MBE(n) calculations provides a correction for the effects of fragmentation, including neglect of long-range polarization.Raghavachari and co-workers have made extensive use of this idea for calculations in biological macromolecules, 59−61,92−95 and our two-layer procedure is equivalent to the "MIM2" strategy defined in ref 89. 6e tested several low-level supersystem corrections in combination with various target levels of DFT, for the activation energy in COMT.Errors with respect to the target level (applied to the entire enzyme−substrate model) are illustrated in Figure 3, and numerical values for each supersystem correction are listed in Table S3.The low-level methods that we tested include the semiempirical thricecorrected methods HF-3c 96 and PBEh-3c, 97 which use a minimal and a double-ζ basis set, respectively.We also tested conventional Hartree−Fock (HF) theory and the density functional LRC-ωPBEh, 98,99 each with the 6-31G basis set.(Note that 6-31G is much less expensive than other double-ζ basis sets if the electronic structure software can take advantage of compound shells. 100) For this particular 632-atom enzyme− substrate complex, all four of these supersystem corrections require a similar computational effort, which constitutes <20% overhead on top of a MBE(2) calculation.
Even without the supersystem correction, results in Figure 3a indicate that a two-body expansion can achieve ∼1 kcal/mol accuracy for E a using various density functionals.Low-cost supersystem corrections decrease this to ∼0.5 kcal/mol.MBE( 3) is an order of magnitude more accurate than MBE(2), achieving ∼0.1 kcal/mol fidelity even without the supersystem correction.MBE(3) seems to represent something of an accuracy limit, as low-cost supersystem corrections no longer improve the results.
Importantly, the HF-3c supersystem correction performs just as well as HF/6-31G, despite using only a minimal basis set ("MINIX"). 96For the 632-atom enzyme−substrate .Multilayer techniques applied to the complex of an enzyme (depicted as a chain of amino acids) and a substrate labeled S. Colors encode the level of theory, with the higher-level method in orange and the lower-level method in blue.(a) Conventional ONIOM method, in which high-level calculations are applied only to the model system.(b) Multilayer fragmentation method, in which the high-level method is applied to the entire system by means of fragmentation.
The Journal of Physical Chemistry Letters complex used to model COMT, this means 1944 basis functions for HF-3c/MINIX versus 3510 functions for HF/6-31G.On a single 28-core node, these supersystem calculations can be completed in 0.6 h (HF-3c/MINIX) and 1.0 h (HF/6-31G), with 80−90% of that time spent in the PCM solver, which is less well-parallelized as compared to two-electron integrals.The PCM cost could be reduced by using a less dense surface discretization.
Having established that we can obtain converged results, we next turn to computational efficiency.The cost of fragmentation methods is not always discussed honestly and should be measured in aggregate computer time rather than wall time. 6,31,39,101Timing data for single-point energy calculations on COMT are provided in Figure 4, with the corresponding numerical data in Table S5.In the absence of any supersystem correction, MBE(2) with PCM boundary conditions is ∼60% of the cost of a supersystem calculation at the same level of theory (ωB97X-D/def2-SVP), whereas MBE( 3) is ∼14 times more expensive than the supersystem calculation, although this cost can be distributed across a large number of compute nodes.Despite using larger fragments, the charge-coordinated MBE(3) calculation is actually ∼10% cheaper than MBE(3) with single-residue fragments, because the former calculation decreases the number of unique subsystems from 7175 to 3581.This balance would likely shift in favor of the singleresidue calculation if a method more expensive than DFT were used.
The number of subsystems required for MBE(n) increases as N n for a protein with N residues, which imposes a severe computational bottleneck even for n = 3. 31 In what follows, we screen the dimers and trimers on the basis of distance, removing them from the calculation if the minimum interatomic distance between any two fragments exceeds a specified threshold, R cut .We then recompute E a and Δ rxn E for COMT, with the caveat that we are careful to ensure that the same residues are included in the reactant, product, and transition state models.Tests of a distance-screened MBE(3) approximation (Figure S5) demonstrate that the predicted value of E a changes by <0.1 kcal/mol as R cut is decreased from 25 to 8 Å. Setting R cut = 8 Å decreases the number of subsystems from 7175 to 1499 (as shown in Figure S6), with a negligible effect on accuracy.Using R cut = 8 Å, the computational effort for ωB97X-D/def2-SVP is reduced from 2025 h (the value shown in Figure 4) to 657 h.The latter  The Journal of Physical Chemistry Letters figure is still 5 times greater than the cost of the corresponding supersystem calculation, however.
We include diffuse functions in our next set of tests (Table 1), because a method that is intended for general application to enzymatic thermochemistry must be able to accommodate diffuse functions to describe anionic side chains, yet these functions often prove to be problematic for self-consistent charge schemes. 77,78Even if the electrostatic embedding charges are taken from a force field and held fixed, QM/ MM-style, the use of diffuse functions may lead to overpolarization of the QM system by the MM charges. 102rrors in E a for COMT, computed using MBE(2) and MBE(3) at the ωB97X-D/def2-SVPD level, are summarized in Table 1, which includes results using a HF/6-31G supersystem correction and/or distance-based screening using R cut = 8 Å.We have also tabulated errors with respect to a ωB97X-D/ def2-TZVP supersystem calculation, providing a measure of the basis set incompleteness error when the smaller def2-SVPD basis set is used.The two-and three-body approximations afford sub-kcal/mol errors with respect to supersystem results using the larger def2-TZVP basis set, suggesting that the basis set incompleteness error is <1 kcal/mol.MBE(3) provides converged results without the need for a supersystem correction, which scarcely alters the results, whereas such a correction affords a small but noticeable improvement to MBE (2).
Consistent, sub-kcal/mol fidelity can be achieved in two ways: MBE(3) alone or MBE(2) with a supersystem correction.Distance cutoffs with R cut = 8 Å can safely be applied in either case.This consistency indicates that the supersystem correction primarily accounts for three-body polarization, and that four-body terms make a negligible (sub-kcal/mol) contribution when PCM boundaries are applied (Figure S4).Of these two protocols, MBE(2) with cutoffs and a supersystem correction is more affordable, by a factor of 11 as compared to MBE(3) with cutoffs and no supersystem correction.Although the best measure of realworld cost is the total (aggregate) time across all processors, if one wants to use throughput as the figure of merit then it is worth noting that the 379 h required for the supersystemcorrected MBE(2) calculation corresponds to 329 distinct subsystems that can be distributed across compute nodes.These results are converged to within ≲1 kcal/mol of a ωB97X-D/def2-TZVP calculation that consists of 11 767 basis functions and requires an aggregate computational time of 17 546 h running on a single 40-core node.
In summary, we find that low-dielectric boundary conditions lead to rapid convergence of the MBE, which otherwise suffers from oscillatory behavior in the presence of charged fragments.Larger, charge-neutral fragments can be used as an alternative strategy to avoid these oscillations, but this will significantly increase the cost if a correlated wave function method is used to describe the subsystems.Because ionic residues must be anticipated in general, this makes dielectric boundary conditions effectively mandatory for QC calculations of enzymatic thermochemistry.These observations furthermore suggest that the use of gas-phase supersystem calculations to benchmark fragmentation approximations distorts the performance of those approximations.Where charged fragments are involved, comparison to a gas-phase calculation may exaggerate the role of higher-order n-body terms.
When dielectric boundaries are employed, MBE(3) provides converged results with sub-kcal/mol fidelity, using singleresidue fragments without the need for electrostatic embedding, conjugated caps, or an ONIOM-style supersystem correction.This relatively simple three-body approach represents a reliable fall-back procedure for systems that are too large for conventional DFT.A practical alternative is MBE(2) with distance screening, in a double-ζ basis set, plus an ONIOM-style supersystem correction at the HF/6-31G level.This composite approach is converged below 1 kcal/mol with respect to a triple-ζ benchmark and is considerably less expensive than the full-system DFT calculation that it aims to approximate, with a cost that can be readily distributed across hardware.
In the end, we find that enzymatic thermochemistry can be reproduced with sub-kcal/mol fidelity using practical protocols based on fragmentation.The stage is set to push the accuracy of these calculations beyond the DFT level, by means of a hybrid approach that deploys high-level methods for the twobody interactions combined with three-body DFT to capture polarization by the protein environment.We are also exploring the use of fragment-based vibrational frequency calculations, as pioneered by others, 103−105 to include zero-point corrections and finite-temperature thermal corrections.(The use of smooth cutoffs in gradient calculations has already been demonstrated. 7) Network analysis can be used to build sensible (if sizable) models of the enzyme−substrate complex, 55,106,107 and then the protocols developed here can provide converged results for any given model.Together, these developments promise to make QC modeling of enzymatic reactions more robust and systematic.

■ COMPUTATIONAL DETAILS
The crystal structure 51 of COMT [Protein Data Bank (PDB) entry 3BWM] was protonated using the H++ web server 108 at pH 7.0, a salinity of 0.15 M, ε in = 10, and ε out = 80.Ligand atoms were protonated separately using PyMOL and then validated against reactant and product structures taken from ref 14.As in that work, the inhibitor 3,5-dinitrocatechol in the crystal structure was replaced with catecholate (C 6 H 5 O 2 − ).Reactant and product structures were relaxed using the GFN2-xTB semiempirical model, 109 with a generalized Born/surface area (GBSA) implicit solvent model for water. 110Recent benchmarking demonstrates that GFN2-xTB with implicit solvent affords protein structures that compare favorably to experiment. 111To obtain the transition state, we scanned the length of the bond between the sulfur atom on SAM and the transferred methyl group.The system was then trimmed to obtain the 632-atom "model 8" from ref 14, which contains residues within a 5 Å radius of the active site along with three important residues identified experimentally.This model affords converged energetics with respect to larger models. 14or AspDC (PDB entry 1UHE), 112 a single monomer unit can be directly downloaded from the PDB, although the complete structure is an octamer.Starting from the latter, a large radial cutoff of 12 Å was used for structure relaxation with GFN2-xTB in implicit solvent.From that relaxed structure, a smaller 5 Å region was created for a scan along the bondbreaking coordinate.A transition state and a product structure were determined from that scan.For fragmentation calculations, the negatively charged ligand and the cationic arginine residue coordinated to it were included in a single fragment, such that all fragments are uncharged.
In creating fragments, we avoid cutting the polar C−N peptide bond (following previous recommendations), 39,59 and instead create fragments by cutting the C α −C carbonyl bond as indicated in Figure S1.The severed valence is capped with a hydrogen atom that is positioned according to eq S1, as in previous work. 39ll QM calculations were run using a home-built fragmentation code (PyFragme∩t), interfaced with Q-Chem. 113For all calculations, the self-consistent field convergence threshold is set to τ SCF = 10 −8 Ha and the integral and shell-pair drop tolerances are set to τ ints = 10 −12 a.u.−117 The continuum interface is defined by a van der Waals cavity, 62 constructed using modified Bondi atomic radii 118 that are scaled by a factor of 1.2.That surface is discretized using 110 Lebedev points for hydrogen and 194 points for other atoms. 114A conjugate gradient algorithm was used to solve the C-PCM equations for the full protein model, 117 whereas matrix inversion was used for the subsystem C-PCM calculations.Calculations with ωB97X-D and M06-2X +D3 use the SG-2 quadrature grid, 119 whereas SG-1 120 is used for other functionals.

Figure 1 .
Figure 1.Errors in MBE(n) calculations at the ωB97X-D/def2-SVP level as compared to a supersystem calculation at the same level of theory: (a) E a for COMT, (b) Δ rxn E for COMT, (c) E a for AspDC, and (d) Δ rxn E for AspDC.For COMT, results are shown for vacuum boundary conditions (ε = 1) vs PCM boundary conditions (ε = 4) and for a charge-coordinated model that uses larger, charge-neutral fragments.The AspDC system does not contain charged fragments.

Figure 2
Figure2.Multilayer techniques applied to the complex of an enzyme (depicted as a chain of amino acids) and a substrate labeled S. Colors encode the level of theory, with the higher-level method in orange and the lower-level method in blue.(a) Conventional ONIOM method, in which high-level calculations are applied only to the model system.(b) Multilayer fragmentation method, in which the high-level method is applied to the entire system by means of fragmentation.

Figure 3 .
Figure 3. Errors in E a for COMT, computed using two-layer fragmentation methods at the (a) MBE(2) or (b) MBE(3) level.Target levels of theory (used for the fragments) are indicated by the colored bars, and the error is assessed with respect to a supersystem calculation at that level of theory.Several low-cost supersystem corrections are evaluated, as indicated along the horizontal axis.All calculations use PCM boundary conditions with ε = 4, without any cutoffs for the MBE(n) calculations.

Figure 4 .
Figure 4. Aggregate computational time (on a logarithmic scale) for a single-point calculation on the 632-atom COMT enzyme−substrate complex, at the ωB97X-D/def2-SVP level of theory.The supersystem calculation contains 6042 basis functions and was performed on a single 28-core node (Dell Intel Xeon E5-2680 v4).Fragment calculations were performed on the same hardware with seven worker processes per node, each using four cores.
Kulik et al. 14 considered a sequence of COMT models with QM regions with up to 940 atoms, and we selected "model 8" from ref 14, which contains 632 atoms and 35 fragments.The largest fragment consists of the octahedral coordination sphere around Mg 2+ , including deprotonated catechol [2-hydroxyphenolate (C 6 H 5 O 2

Table 1 .
Errors in E a for COMT, for Calculations at the ωB97X-D/def2-SVPD Level a All calculations were performed using a PCM with ε = 4. b Error with respect to a supersystem calculation at the same level of theory.c Error with respect to a supersystem calculation at the ωB97X-D/def2-TZVP level of theory.d Aggregate computer time for one single-point energy calculation, using a single 48-core node (Intel Xeon Platinum 8268).Fragment calculations employ 12 worker processes, each running on four cores.e HF/6-31G as a supersystem correction.f Screening threshold R cut = 8 Å. a