High Accuracy Semi-Empirical Quantum Models Based on a Reduced Training Set

There exists a great need for computationally efficient quantum simulation approaches that can achieve an accuracy similar to high-level theories while exhibiting a wide degree of transferability. In this regard, we have leveraged a machine-learned force field based on Chebyshev polynomials to determine Density Functional Tight Binding (DFTB) models for organic materials. The benefit of our approach is two-fold: (1) many-body interactions can be corrected for in a systematic and rapidly tunable process, and (2) high-level quantum accuracy for a broad range of compounds can be achieved with ∼0.3% of data required for one advanced deep learning potential (ANI- 1x). In addition, the total number of data points in our training set is less than one half of that used for a recent DFTB-neural network model (trained on a separate dataset). Validation tests of our DFTB model against energy and vibrational data for gas-phase molecules for additional quantum datasets shows strong agreement with reference data from either hybrid density-functional theory, coupled-cluster calculations, or experiments. Preliminary testing on graphite and diamond successfully reproduce condensed phase structures. The models developed in this work, in principle, can retain most of the accuracy of quantum-based methods at any level of theory with relatively small training sets. Our efforts can thus allow for high throughput physical and chemical predictions with up to coupled-cluster accuracy for materials that are computationally intractable with standard approaches.

only a small fraction of the computational cost compared to DFT or other high-level quantum approaches. Here, the DFTB total energy is written as: where E BS corresponds to the band structure energy, E Coul is the charge fluctuation term, and E rep is the repulsive energy. E BS is calculated as a sum over occupied electronic states from the DFTB Hamiltonian. In practice, DFTB Hamiltonian matrix elements are computed from pre-tabulated Slater-Koster tables derived from reference calculations with a minimal basis set. The repulsive energy, E rep , corresponds to ion-ion repulsions, as well as Hartree and exchange-correlation double counting terms. This term can be expressed as an empirical function where parameters are fit to reproduce high-level quantum or experimental reference data. A pairwise potential energy function is often used for the repulsive energy term, 8,9 though many-body interaction terms are required in some cases. 10, 11 DFTB is approximately three orders of magnitude more efficient than DFT calculations and exhibits O(N 3 ) scaling.
Its combination of approximate quantum mechanics with empirical functions can allow for a high degree of flexibility in terms of optimization approaches, desired accuracy, and transferability across element types and diverse conditions. 12-14 DFTB models have been created for a broad range of materials, though the repulsive energy largely has been tuned to relatively low-level DFT data for condensed phases. [15][16][17][18][19] and the Curvature Constrained Splines methodology 14 have been used to create strictly pair-wise additive repulsive energies for several organic and inorganic systems. However, these methods can struggle for systems where greater than two-body interactions in E rep are needed. 14 NNs have been proposed as a promising method to include many-body interactions into the DFTB repulsive energy. 22 Here, we explore the possibility of creating DFTB models that can leverage the relative simplicity of linear regression machine learning in the recently developed Chebyshev Interaction Model for Efficient Simulation (ChIMES) method. ChIMES is a many-body force field based on linear combinations of Chebyshev polynomials. 25 It has been shown that ChIMES models yield good agreement with DFT reference method for a wide range of properties and materials under both ambient and extreme conditions. [26][27][28] The main advantage of ChIMES is that it is completely linear in fitted coefficients, allowing for rapid parameterization to a global minimum. The reliance on Chebyshev polynomials, which are orthogonal, allows the complexity of a ChIMES model to be systematically tuned to an arbitrary degree of accuracy and transferability, while also providing straightforward methods for regularization to minimize overfitting. 16 In this study, we determine an optimal DFTB/ChIMES model for C, H, N, O-containing systems using high level quantum chemical reference data. We use an iterative scheme to systematically expand our training set where at each iteration, a small fraction of the force configurations with largest deviation in our validation set are included in the next training set iteration. The accuracy and transferability of the resulting model are investigated for a wide variety of gas-phase clusters as well as some carbon solids. We find that use of a small fraction of our chosen data set (∼0.3% of similar NN efforts) yields DFTB/ChIMES models that maintain close to hybrid functional, coupled-cluster, and/or experimental accuracy for the gas-phase clusters studies here, and compares favorably to previous DFTB-NN efforts for similar systems.
For our DFTB/ChIMES models, the total energy is determined as the sum of the standard DFTB energy with an additional ChIMES contribution: For this work, DFTB calculations are performed using the 3ob-3-1 parameter set, which contains a third-order expansion about the charges and is considered an optimal DFTB starting point for most organic system. 22,29 The ChIMES energy is written as a many-body expansion: where n a is the number of atoms in the system. The atomic energies E i are constants used to match energies from reference data, and two-body (pairwise) energies are expressed as linear combinations of Chebyshev polynomials of the first kind. 30,31 Higher-bodied interactions are determined through products of a cluster's constituent pair-wise polynomials. 32 ChIMES parameters are determined by fitting to the difference between the reference energies and atomic forces and those computed from DFTB alone using the following objective function: where N d is the total number of data entries, given by Here the number of gas phase molecular conformations in the training set is given by n g and n The subscripts "ref", "DFTB", and "ChIMES" indicate the predicted quantity X from reference method, DFTB, and the present ChIMES correction, respectively. Further details about our DFTB calculations, the ChIMES functional form, the fitting procedure and ChIMES hyper-parameter selection (including radial ranges, polynomial orders, and other pertinent details) can be found in the Supporting Information.
The dataset used to develop the DFTB/ChIMES model here is a subset of the ANI-1x dataset which we will refer to as 'sub_ANI-1x'. It contains only molecular conformations from ANI-1x computed using CCSD(T)/CBS, wB97X/def2-TZVPP, and wB97X/6-31G* levels of theory. 33 This corresponds to ∼10% of the full ANI-1x data set, and resulted in 459,464 molecular conformations of 1895 unique molecules, including transition states of some chemical reactions. Since there are no data for atomic forces at the CCSD(T)/CBS level of theory in the 'sub_ANI-1x', for our fitting purposes we use the wB97X/def2-TZVPP reference data only. We note here that fitting a DFTB/ChIMES model using a whole 'sub_ANI-1x' set directly would utilize ∼19M data points, resulting in a slow parameterization. In addition, it is of great benefit to create semi-empirical quantum approaches that do not require the traditionally vast amounts of training data needed by most machine learning approaches. As a result, our first objective is to determine how much data is needed  forces) are smaller than the variations between wB97X-DFT itself and higher levels such as CCSD(T) and MP2 (4.9/5.9 kcal/mol for energies and 4.6/5.9 kcal/mol-Å for forces). 3 The performance of DFTB/ChIMES in comparison to coupled-cluster reference data is also provided in Table 1. Here, we have selected the ISO34 data set 35  and DFTB-NN rep (DFTB-NN with deep tensor neural networks). 22 One can see that the accuracy of DFTB/ChIMES is much better than that for standard DFTB, is slightly improved over that from DFTB-NN rep , and approaches the PBE0 data given in Reference 22.
To test the performance of our model on high accuracy force data specifically, we compare DFTB/ChIMES with the CCSD(T)/cc-pVTZ data for 2000 configurations of ethanol in the GDML data set 38 (54,000 data points total). Again our DFTB/ChIMES gives an improvement over standard DFTB as MAE and RMSE are both reduced by ∼40%. A direct force comparison to DFTB-NN rep or the ISO34 reference was unavailable.
To probe the smoothness in the potential energy surface from DFTB/ChIMES, we have also computed the potential energy profile for rotation around the dihedral angles in alka-nes. The torsional profile for n−butane is shown in Figure 2    Graphical TOC Entry