Systematic QM Region Construction in QM/MM Calculations Based on Uncertainty Quantification

While QM/MM studies of enzymatic reactions are widely used in computational chemistry, the results of such studies are subject to numerous sources of uncertainty, and the effect of different choices by the simulation scientist that are required when setting up QM/MM calculations is often unclear. In particular, the selection of the QM region is crucial for obtaining accurate and reliable results. Simply including amino acids by their distance to the active site is mostly not sufficient as necessary residues are missing or unimportant residues are included without evidence. Here, we take a first step towards quantifying uncertainties in QM/MM calculations by assessing the sensitivity of QM/MM reaction energies with respect to variations of the MM point charges. We show that such a point charge variation analysis (PCVA) can be employed to judge the accuracy of QM/MM reaction energies obtained with a selected QM region, and devise a protocol to systematically construct QM regions that minimize this uncertainty. We apply such a PCVA to the example of catechol O-methyltransferase, and demonstrate that it provides a simple and reliable approach for the construction of the QM region. Our PCVA-based scheme is computationally efficient and requires only calculations for a system with a minimal QM region. Our work highlights the promise of applying methods of uncertainty quantification in computational chemistry. Table of

is crucial for obtaining accurate and reliable results. Simply including amino acids by their distance to the active site is mostly not sufficient as necessary residues are missing or unimportant residues are included without evidence. Here, we take a first step towards quantifying uncertainties in QM/MM calculations by assessing the sensitivity of QM/MM reaction energies with respect to variations of the MM point charges. We show that such a point charge variation analysis (PCVA) can be employed to judge the accuracy of QM/MM reaction energies obtained with a selected QM region, and devise a protocol to systematically construct QM regions that minimize this uncertainty. We apply such a PCVA to the example of catechol O-methyltransferase, and demonstrate that it provides a simple and reliable approach for the construction of the QM region.
Our PCVA-based scheme is computationally efficient and requires only calculations for a system with a minimal QM region. Our work highlights the promise of applying methods of uncertainty quantification in computational chemistry.

Introduction
Since the introduction of the QM/MM approach by Warshel and Levitt in 1976 1 it has evolved to a broadly used tool in computational chemistry. Dividing a large biomolecular system into two subsystems, one smaller subsystem treated quantum-mechanically (QM) and one larger subsystem treated with molecular mechanics (MM), allows one to investigate the mechanisms and energetics of enzymatic reactions in an efficient way. [2][3][4] Consequently, the application of the QM/MM approach to large biomolecules such as enzymes has become a common practice in the last two decades. [5][6][7][8][9][10][11] However, in practice setting up QM/MM calculations is far from trivial and requires many manual choices by the simulation scientist. 4,8 Besides the selection of a suitable QM method and an MM force field, the most important decision is the choice of the QM region.
The convergence of QM/MM results with the choice and particularly the size of the QM region has been investigated in several studies, [12][13][14][15][16] which underline that it generally has a large effect on the quality of the final results, and that often rather large QM regions are required for reaching converged results. To alleviate this problem, schemes for the systematic construction of QM regions have been proposed, usually with the aim of obtaining medium-sized QM regions that provide reliable QM/MM reaction energies. Examples of such schemes include free energy perturbation analysis, 17 charge deletion analysis, 18 charge shift analysis (CSA), 15 and Fukui shift analysis (FSA). 19 For their recently developed selfparametrizing system-focused atomistic models (SFAM), Brunken and Reiher proposed an automatic scheme for the construction of hybrid QM/SFAM models, including a systematic determination of the QM region based on the energy gradient. 20 For a given QM region, there are different possible choices of the embedding method, 21 of a suitable coupling scheme, 22 and for the treatment of the boundary region 23 if covalent bonds cross the border between the subsystems. For the commonly used electrostatic embedding scheme, 4 the choice of the MM point charges will influence the resulting QM and QM/MM energies, and further parameters need to be chosen in advanced polarizable embedding 24,25 or flexible embedding schemes. 26,27 Different coupling schemes are available such as IMOMM, 28 ONIOM, 29 or the AddRemove 30 model. The most common approach for treating covalent bonds across the QM-MM boundary is saturating the QM region with capping atoms. 4 Here, the position and the type of these atoms is crucial for the calculations. An alternative are frozen orbitals, 31 e.g., in the Localized SCF (LSCF) 32 or the Generalized Hybrid Orbital (GHO) 33 approach. The uncertainties introduced by all these different choices and the corresponding parameters are interconnected and will again depend on the choice of the QM region.
Consequently, there is a need for rigorous uncertainty quantification for QM/MM calculations, i.e., to systematically assess the sensitivity of the QM/MM energy with respect to these technical choices and empirical parameters and to ultimately provide rigorous error bounds on the results of QM/MM calculations (compared to a full QM treatment). Mathematical and computational tools for quantifying uncertainties in computer simulations have been developed intensively in the past decades (for textbooks, see, e.g. Refs. 34,35) and are employed in many areas of simulation science, 36,37 but their application is just starting to emerge in computational chemistry. 38 Recently, we have applied such tools for analyzing the sensitivity of calculated spectra with respect to distortions of the molecular structure. 39,40 Here, we aim at taking a first step towards uncertainty quantification for QM/MM methods by analyzing the sensitivity of QM/MM reaction energies with respect to variations of the MM point charges. While there are other relevant empirical parameters entering in QM/MM calculations, most importantly those related to the treatment of the QM-MM boundary (e.g., to the placement of the link atoms), we expect the MM charges to be a key factor influencing the final QM/MM results.
The ability to quantify the sensitivity of QM/MM calculations with respect to parameters of the MM environment provides a natural starting point for guiding the systematic choice of the QM region. With increasing size of the QM region and approaching a full QM calculation, one can expect this sensitivity to decrease. Therefore, choosing the QM region such that the uncertainty is reduced implies that the QM/MM calculation approach those of a full QM calculation. Here, we exploit this idea by proposing a simple and efficient scheme for the systematic construction of the QM region that is guided by uncertainty quantification.
This work is organized as follows. In Section 2 we recall the necessary theoretical background of QM/MM approaches (Sect. 2.1, introduce our point charge variation analysis (PCVA) for analyzing the sensitivity of QM/MM energies (Sect. 2.2, and give the computational details (Section 2.3). In Section 3, we introduce the model system used in this work and discuss the convergence of ligand charges and reaction energies for QM regions of increasing size. The sensitivity of these quantities with respect to global point charge variations is analyzed in Section 4. This is followed in by the evaluation of the energy sensitivity for single amino acids in Section 5, which are used to devise a scheme for the systematic construction of QM regions based on a PCVA. The QM regions obtained with this PCVAbased scheme are assessed for QM regions of increasing size and for atom-economical QM regions in Sections 6 and 7, respectively. Finally, conclusions and an outlook can be found in Section 8.
The exact definition of these three energy contributions varies between different implementations of QM/MM schemes. 4,8 In the following, we focus on the general principles as far as where N B is the number of atoms in subsystem B and q I,B represents the MM point charge and R i,B the position of the I-th atom. This electrostatic interaction energy is usually included in the QM energy of subsystem A, i.e., Altogether, the QM/MM energy can be expressed as It thus consists of the embedded QM energy of subsystem A, the MM energy of subsystem B, and the non-electrostatic interactions between the two subsystems.
When investigating enzymatic reactions with QM/MM calculations, the main quantity of interest (QoI) is generally the reaction energy,

Sensitivity analysis for QM/MM energies
The reaction energy calculated within a QM/MM model is subject to numerous sources of uncertainty (see Introduction). One important element of uncertainty quantification 34,35 is the analysis of the sensitivity of the simulation results with respect to its input parameters. 41 Here, we consider the QM/MM reaction energy ∆E reaction QM/MM as our quantity of interest (QoI) and analyze how sensitively it depends on parameters of the QM/MM model.
We consider the MM point charges q MM as one of the most important sources of uncertainty and want to systematically analyze the effect of variations in the MM point charges on our QoI, i.e., the reaction energy ∆E reaction QM/MM (q MM ). To this end, we follow our earlier work on the sensitivity of calculated spectra with respect to distortions of the molecular structure 39 and consider a collective variation of the MM point charges, i.e., where q MM is a vector of size N B containing all MM point charges, q 0 MM is the vector of the undistorted MM point charges as provided by the employed force field, and ∆q MM is a collective variation of these point charges, which depends on a parameter ∆q that controls the size of the variation. We chose the collective variations of the MM point charges such that I ∆q MM,I = 0, i.e., the sum of the MM point charges is preserved.
In the following, we will consider two types of collective point-charge variations. First, we change the charges of all protein MM atoms simultaneously by an equal magnitude ∆q, while changing all solvent MM charges equally such that the total charge is preserved, i.e, where N protein where N aa,i is the number of atoms in the i-th amino acid. The first will provide an estimate of the overall sensitivity of the QM/MM reaction energy to variations of the MM point charges, while the second will allow us to assess the effect of the individual single amino acids.
For these collective point-charge variations, we perform a local sensitivity analysis 41 and consider the derivative of ∆E with respect to a the parameter ∆q, that is, the derivative is taken in the direction of the collective point-charge variation.
Of the components of the QM/MM energy [see Eq. (4)], the non-electrostatic interaction energy does not depend on the MM point charges. Therefore, the sensitivity of the QM/MM energy of the reactants or the products is given by, and the sensitivity of the reaction energy can be calculated as where the superscripts R and P designate the reactants and products, respectively. If the protein environment is similar for the reactants and the products, which is usually the case for enzymatic reactions, the second term can be expected to be small and could possibly neglected, i.e, The simplest way of obtaining the derivatives necessary for the calculation of the sensitivity δ∆E reaction QM/MM is their numerical evaluation, either using a symmetric two-point finite difference formula, (13) or using a forward two-point finite difference formula, An analytical evaluation of these derivatives is also possible, but would require modifications of the quantum-chemical software packages used for subsystem A.  Table S1 in the Supporting Information). These QM   The absence of only one of these three residues causes the charge of +0.55.

Computational Details
Overall, it can be stated that small QM regions are not sufficient for reproducing the ligand charges found for large QM regions because important residues coordinating the ligands might be missing. When using a distance-based construction of the QM region, rather larger QM regions need to be reached before all relevant residues are included in the QM region.
Finally, the QM/MM reaction energy for QM regions of increasing size is shown in Fig. 2C. Overall, our results confirm the slow convergence of the reaction energy with increasing size of the QM region found in earlier studies and emphasize the need for systematic protocols for the construction and selection of the QM region. We note that Jindal and Warshel 14 found that in COMT, the activation barrier shows a much smaller sensitivity with respect to the choice of the QM region than the reaction energy. Therefore, we will not consider activation barriers and focus on the reaction energies in the present work.    Fig. S3). Using the product or reactant energy sensitivities could also be advantageous for avoiding error cancelation that might be present when considering the sensitivity of the reaction energy only. Consequently, we choose to use only the reactant energies, which reduces the number of calculations to be performed by half. The impact for this approximation on the selection of the QM region will be discussed below. Finally, instead of using a symmetric two-point formula for the numerical differentiation, it turns out to be sufficient to use a forward finite-difference formula (PCVA-E), which again reduces the number of QM calculation by another factor of two.
Altogether, we employ the PCVA-E approximation and calculate the sensitivity with respect to point-charge variations for the i-th amino acid as, Naively, it could be expected that residues closer to the active site show higher sensitivities to point charge variations than distant ones. However, there are also several high-sensitivity amino acids at medium distances to the active site and low-sensitivity residues very close to the substrates. This observation confirms that an exclusively distance-based approach to include residues into the QM region is not able to detect all important amino acids and furthermore includes residues which are probably not necessary to obtain consistent QM regions. To find a compromise between including amino acids that show a high sensitivity and those that are close to the active site, we define an empirical QM region indicator Θ i for each amino acid by dividing the sensitivity by the COM distance between the amino acid and the active site, i.e., This definition ensures that distant residues with high sensitivities are considered, but also residues close to the active site which may show medium sensitivities are not overlooked.
The resulting indicator is plotted in Fig. 4B. A comparison of the indicators for the schemes PCVA-A to PCVA-D is shown in the Supporting Information (Fig. S4), and an additional 6 Assessment of PCVA-Based QM Regions of Increasing

Size
As a first test, we assess the PCVA-based construction of QM regions with increasing size.
Ideally, by using PCVA for a systematic construction of QM regions, the convergence towards the results obtained with very large QM regions should be accelerated compared to an exclusively distance-based construction of the QM region. To this end, we construct QM regions consisting of just as many residues as in the distance-based approach (see Section 4) and label these regions as 2', 3' etc. (e.g., QM regions 3 and 3' both contain seven amino acid residues). Fig. 5 and Tab. S4 in the Supporting Information show which residues are included for each QM region. Again, geometry optimizations of the reactant and product structures were performed for each of these QM regions.  for the distance-based construction of QM regions). The VDD charges (see Fig. 6A) converge to similar values as for the distance-based construction of the QM region. Remarkably, the Mg 2+ charge converges to about +0.3 already for region 3' and larger, which was achieved for the distance-based inclusion of residues only with region 8. The SAM charges behave similarly to the distance-based case. From region 3' onwards, it stabilizes at around zero, before reaching +0.25 in region 7' and larger. The CAT charges again starts converging for region 4' and larger.
PCVA-constructed QM regions deliver overall lower sensitivities regarding VDD ligand charges (see Fig. 6B), which marks an expected behavior because high-sensitivity residues are included in the QM region by PCVA. The convergence of the charge sensitivities is similar to the analysis for distance-based QM regions (cf. Fig. 2) with a fast convergence for Mg 2+ , a constant behavior for CAT, and jumps for SAM with a slightly decreasing trend.
The reaction energy (see Fig. 6C) decreases for the smallest QM regions, starts to stabilize for QM regions 4', and oscillates around our best estimate of about −11 kcal/mol for larger QM regions. However, these oscillations are smaller than for the distance-based construction of the QM regions. Note that compared to Fig. 2, we updated our best estimate to correspond to QM region 9' in Fig. 6C.
In contrast to the distance-based case, a slightly increasing trend is observed for the reaction energy sensitivity (Fig. 6D). For QM region 8', at which the reaction energy is further from our best estimate, the sensitivity also shows a marked increase. which also each include 16 amino acids. Table 1 compares the amino acid residues included in such an atom-economical QM region for a distance-based QM region construction, the CSA and FSA approaches, and for our PCVA-based QM-region construction.
Compared to the CSA and FSA approaches applied by Kulik et al. 15,19 our PCVA approach detects the majority of the residues that should be part of the 16-residue QM region, as it can be seen in Table 1 (e.g., VAL41, GLU89, or ASP140). Moreover, also more distant residues are included that would not be considered in a 16-residue QM region in the distance-based case (e.g., MET39, LYS45, ALA66, or ASP168). This indicates the ability of PCVA to even detect high-impact residues located relatively distant from the active site COM. Nevertheless, there are several residues included by CSA and/or FSA that are not under the 16 highest-ranked residues when applying the PCVA approach. However, most of these residues are assigned a PCVA rank close to 16 such as ASN40, GLU63, or ILE90.
An extreme case with a very low PCVA rank (178) compared to CSA (10) and FSA (10)(11) is SER71. This residue is located very close to the active site and thus potentially important for ligand binding, but it shows a very low electrostatic effect on the substrates and is thus not detected by PCVA. Another case is SER118, which is ranked about 30 places lower in PCVA than in CSA and FSA. Both these cases concern serine residues, which might show only a small electrostatic effect.
The shown differences between PCVA and CSA or FSA do not indicate that PCVA is performing worse as none of the existent methods can be considered the gold standard.
Furthermore, even between CSA and FSA major differences in the evaluation of residues can be observed (e.g. for GLU63, GLY65, TYR70, or ALA72) leading to differently composed QM regions.
In Fig. 6, we compare the VDD charges of SAM, CAT, and the catalytically-active Mg 2+ cation as well as the QM/MM reaction energy obtained for the atom-economical QM regions constructed using CSA, FSA, and PCVA to those obtained for PCVA-constructed QM regions of increasing size discussed in Section 6. Note that the size of the atom-economical QM regions of 16 amino acids is in between those of QM regions 4' (13 amino acids) and 5' (19 amino acids).
Regarding VDD charges (see Fig. 6A  For the considered test case, our scheme leads to a faster and more reliable convergence with the size of the QM region compared to distance-based QM region construction. Comparing to atom-economical QM regions of the same size provided by the other common approaches, in particular CSA and FSA, our PCVA-based approach performed well and yields similar QM regions. The huge advantage of PCVA is its much lower computational cost compared to the CSA or FSA approach (see Supporting Information, Tab. S5). Our PCVA-based approach requires only a geometry optimization of the target system with a minimal QM region including substrates, which is followed by single-point calculations for the point-charge variation of each amino acid. In contrast, CSA is a very expensive approach based on large QM regions with up to 1000 atoms. For these large systems, geometry optimizations have to be performed for the holo and apo enzyme structure for several snapshots along the reaction coordinate. 15 FSA, even though being much cheaper than CSA, still needs as many geometry optimization as there are amino acids in the system for the minimal QM region plus one additional residue in each calculation. 19 A rather similar approach to our PCVA-based construction of the QM region, the charge deletion analysis (CDA), is mostly reported for the usage with mediumsized QM regions, 18,[65][66][67][68][69][70] which also increases the required computational effort. Of course, our PCVA-based approach can also be applied for larger QM regions, but we found that this does not improve the results substantially.
The PCVA-based construction of the QM region is limited to the electrostatic effect of the amino acids and thereby lacking other properties which may also play an important role in QM region determination. Therefore, it is possible that crucial residues (e.g. catalytically important) may be absent under the detected residues. Here, the biochemical and structural understanding should be considered as well when constructing QM regions. Altogether, we suppose that our fast and computationally cheap approach is a good complement to existing methods for the automatic and systematic QM region construction. We expect that future developments concerning uncertainty quantification for QM/MM calculation will also allow for the development of more sophisticated schemes for the systematic construction of QM regions.

Supporting Information
Additional tables with details about the composition of distance-based, PCVA-based and 16residue QM regions, an extended version of Table 1, and a comparison of the computational effort. Additional figures with information about the HOMO-LUMO gap, the SAM-CAT distance convergence, and the comparison of sensitivities and indicators for the different approximate PCVA schemes.