Evaluation of Thermochemical Machine Learning for Potential Energy Curves and Geometry Optimization

While many machine learning methods, particularly deep neural networks have been trained for density functional and quantum chemical energies and properties, the vast majority of these methods focus on single-point energies. In principle, such ML methods, once trained, offer thermochemical accuracy on par with density functional and wave function methods but at speeds comparable to traditional force fields or approximate semiempirical methods. So far, most efforts have focused on optimized equilibrium single-point energies and properties. In this work, we evaluate the accuracy of several leading ML methods across a range of bond potential energy curves and torsional potentials. Methods were trained on the existing ANI-1 training set, calculated using the ωB97X / 6-31G(d) single points at non-equilibrium geometries. We find that across a range of small molecules, several methods offer both qualitative accuracy (e.g., correct minima, both repulsive and attractive bond regions, anharmonic shape, and single minima) and quantitative accuracy in terms of the mean absolute percent error near the minima. At the moment, ANI-2x, FCHL, and our new grid-based convolutional neural net show good performance. Abstract While many machine learning methods, particularly deep neural networks, have been trained for density functional and quantum chemical energies and properties, the vast majority of these methods focus on single-point energies. In principle, such ML methods, once trained, oﬀer thermochemical accuracy on par with density functional and wave function methods but at speeds comparable to traditional force ﬁelds or approximate semiempirical methods. So far, most eﬀorts have focused on optimized equilibrium single-point energies and properties. In this work, we evaluate the accuracy of several leading ML methods across a range of bond potential energy curves and torsional potentials. Methods were trained on the existing ANI-1 training set, calculated using the ω B97X / 6-31G(d) single points at non-equilibrium geometries. We ﬁnd that across a range of small molecules, several methods oﬀer both qualitative accuracy (e.g., correct minima, both repulsive and attractive bond regions, anharmonic shape, and single minima) and quantitative accuracy in terms of the mean absolute percent error near the minima. At the moment, ANI-2x, FCHL, and a new libmolgrid-based convolutional neural net show good performance.


Introduction
Machine learning (ML) methods have been proposed as surrogates for time-consuming quantum mechanical calculations, such as density functional and first-principles methods, for their rapid prediction potential once trained. [1][2][3][4][5][6][7][8][9][10][11] For ML to be a successful surrogate, the methods need to be able to perform property predictions adequately for optimized geometries, capture not just the well of the potential energy curve but also the anharmonicity that force field methods fail to capture, and appropriately handle multiple conformations of the same molecule.
Numerous studies have shown the proficiency of ML methods to predict thermochemical parameters at already optimized geometries utilizing various types of representations and neural network structures. 2,12 Early representations, such as Coulomb Matrix 13 and bag-offeatures, 14,15 demonstrated success in property predictions with further iterations of representations such as FCHL 16,17 continuing to improve the property prediction at optimized geometries. These ML methods are typically trained on the QM7 13,18 or QM9 19,20 data sets consisting of optimized molecules with up to 7 or 9 heavy atoms respectively and help to demonstrate ML's potential as a surrogate.
Additional deep neural network (DNN) methods, like ANI [3][4][5]21 and BAND NN, 22 used training data beyond optimized single points to better evaluate the potential surface for dynamics and geometry optimizations. These methods utilize the ANI-1 data set, 23 or ANI-2 data set in the case of ANI-2x, for training as it contains both equilibrium and non-equilibrium structures of up to eight heavy atoms containing H, C, N, and O with the non-equilibrium structures being generated from normal-mode sampling. The training set for ANI-2x adds the additional elements of F, Cl, and S while providing additional torsion sampling data. 5 The BAND NN model uses a subset of the ANI-1 data set that only uses nonequilibrium geometries with energies within 30 kcal/mol of the equilibrium energy. Although these methods have been shown to perform adequately in their respective papers, the range for bond stretch applications has been limited to the harmonic portion of the potential energy curve, rarely examining the potential energy curves further from equilibrium.
Recent work has expanded the knowledge on ML performance for predicting and ranking thermally accessible conformations. 24 Though ML was not tasked with large bond stretches as being done in this work, the ability of ML methods to rank conformational energy was only comparable to that of semiempirical methods. While this is not equivalent to the accuracy of density functional (DFT) or ab initio electronic structure methods, the accuracy ML is desired to be a surrogate of, ML method performance shows promise with future work on models and training sets improving future performance.
For ML to become a viable replacement for current methods, ML needs to achieve optimized geometries and predict properties without relying on force field (FF) methods. Most FFs have been refined for biomolecules and can struggle with non-covalent and steric interactions for applications such as conjugated polymers. While these issues can be lessened with specific parameterization, 25,26 geometries of FFs generally can be less than ideal. 27 ML trained on higher levels of theory ideally captures these non-covalent interactions and provides better initial optimized geometries.
With the rapid adoption of ML, there has been a growing desire to use ML in molecular dynamics (MD) applications to provide more accurate simulations than FFs at a much lower cost than time-consuming quantum mechanical calculations. 24 For ML to be reliable, it needs to properly predict conformational changes that occur in MD simulations from nonequilibrium bond stretching to torsional barriers. This work looks to examine how well the current state of ML performs at these tasks as well as display the methods' understanding of chemical physics to help decide what needs to be addressed for ML to improve as a surrogate for computationally expensive calculations.

Molecules
A mixture of small and large molecules was chosen to evaluate ML performance on potential energy surfaces for a total of 17 bond stretches and 5 dihedral scans. The molecules examined were benzene (C-C and C-H stretching), methanol, methane, CO, H 2 , ethylene, water, acetylene, hydrogen cyanide, N 2 , ammonia, biphenyl, aspartame, sucrose, dialanine, and diglycine. Bond stretches were evaluated every 0.1Å while dihedrals were evaluated every 20 • with the exception of biphenyl which was every 15 • .
We also trained a deep convolutional neural network (Colorful CNN), an approach that has been successfully used in protein-ligand binding affinity prediction. 42,43 The input molecule is represented as a voxelized grid of atomic densities as generated by the libmolgrid library. 44 Our network has six modules separated by pooling operations each with seven convolutional layers. Full details of the model and training procedure are provided in the Supplementary

Methods.
Due to method scaling efficiency for memory usage, a subset of the ANI-1 data set was taken for training representations using BOB/KRR and BOB/BRR. For consistency, ECFP/RFR and BOB/RFR were additionally trained on this subset. The subset consists of 5 nonequilibrium geometries for every molecule with up to 7 heavy atoms, as well as 5 nonequilibrium geometries for half of the molecules with 8 heavy atoms, to create a training set consisting of 33,496 molecules and 167,480 non-equilibrium geometries. All molecules from the test set were removed from the training set. This training set was additionally used for BOB/RFR and ECFP/RFR. An additional subset of the first 5000 non-equilibrium geometries was used for FCHL/KRR. Increasing the training set for FCHL/KRR had a negative impact on prediction performance so our results are with the model trained on 1000 different molecules for a total of 5000 non-equilibrium geometries.

Results and discussion
To illustrate the qualitative performance of potential energy surface predictions, we analyzed both small and larger molecules outside of the ANI-1 data set used for training for each ML method. We wish to focus on how the methods perform not only around the bond length at the energy minima, r 0 , but also in the attractive and repulsive regimes to gain a better understanding of how ML methods would behave if given less ideal starting geometries for a task such as geometry optimization. Each ML method was evaluated on the criteria demonstrated in Table 1 for bond stretches.
The median mean absolute percent error (MAPE) was calculated from the energy values ranging from r 0 ± 0.25Å for the molecules to determine how accurate and precise the ML predicted energies are. Additional evaluation criteria included the accuracy of r 0 prediction for each molecule, the repulsive wall, the attractive forces, and the incidences of additional minima past 2Å.
While methods like BOB/BRR and BOB/KRR had the second and fifth-lowest median MAPE, their ability to predict the geometry with the lowest energy, a repulsive wall, and attractive forces was quite poor compared to the other top methods based on MAPE. Other methods utilizing RFR also performed poorly, often predicting stepwise energy surfaces seen in Figure 1, thus being incapable of consistently predicting r 0 , attractive, or repulsive forces. This is seen in Figure 1b when the bond breaking causes the only change in the ECFP representation and leads to the higher energy. Other ML methods such as ANI-1x, ANI-2x, FCHL, and Colorful CNN were able to accurately predict energies while also predicting the repulsive and attractive forces of the molecule. In short, while random forest methods may have accuracy at single-point properties, they prove inherently inaccurate for potential energy and should be avoided. A possible advantage for the ANI-1x and ANI-2x models is that some molecules in our test evaluation are found in the ANI-1x training set. In the training of the other methods, molecules in our test set were purposefully left out of the training set but may be present in the ANI-1x and ANI-2x model. For that reason, we will focus the remainder of our discussion on molecules outside of the ANI-1 training set, examining the best overall performers, ANI-1x, ANI-2x, FCHL, and Colorful CNN from Table 1. The performance of all methods is included in the supplemental information. while a unique bond, demonstrates the need to be careful when applying ML to molecules or chemistry completely outside the scope of the training set. Figure 2c and 2d demonstrate the prediction capability of these ML methods on bond stretches for molecules larger than the training set. FCHL was only able to accurately capture the shape of the potential energy curve for dialanine, failing to capture the well of the potential energy curve for aspartame, perhaps from the difficulties training the entire ANI-1 set. ANI-1x, ANI-2x, and Colorful CNN retain both repulsive and attractive information while having accurate energies to that of ωB97X for both aspartame and dialanine. These methods do continue to exhibit difficulty in accurately predicting bond compression under 1Å as well as bond stretching after 2Å. Energy (kcal/mol)  For bond stretches, ANI-1x, ANI-2x, Colorful CNN, and FCHL models show promise with initial training indicating these methods can accurately predict the bottom of the potential energy well. While force fields such as MMFF94 or GAFF can be used to obtain optimized geometries near this regime, ultimately ML methods should exhibit accuracy not only at single-point energy evaluation tasks, but at qualitatively and quantitatively accurate potential energy curves. Further training on long-range attractive forces might enable ML models to evaluate non-covalent interactions.
As an example, further evaluations were carried out on energy predictions from frozen-rotor dihedral angle scans for biphenyl and sucrose. Table 2 compiles the predicted lowest energy angle for these molecules as well as the barrier energies from −45 • to 0 • for biphenyl and 0 • to −60 • for sucrose.
ANI-1x and ANI-2x properly predict the lowest energy angle for biphenyl while Colorful CNN predicts −45 • to be a local, but not global, minima. FCHL improperly predicts rotation energies as seen in Figure 3a, predicting 0 • , 180 • , and −180 • to be the lowest energy dihedrals. All of the methods over-predicted the height of the energy barrier for biphenyl.
For sucrose, all four methods correctly predicted the lowest energy angle. ANI-1x best captures the energy of the dihedral angles, seen in Figure 3b, with ANI-2x and Colorful CNN under-predicting the energy for most angles. Unlike with biphenyl, FCHL captures the shape of the torsion scan for sucrose but vastly over predicts the energies at each angle.  to that of ωB97X and FFs, MMFF94 and GAFF. ANI-1x, ANI-2x, and Colorful CNN retain the resolution of some of the higher energy φ and ψ between −100 • to 100 • while FCHL predicts these to be lower energy confirmations similar to both FF methods. In lower energy conformations both BAND and BOB/KRR methods over-estimate these energy differences. The additional torsion training in ANI-2x provided a beneficial reduction in the MAE for both dialanine and diglycine, seen in Figure 3, by roughly 35% from ANI-1x. Additional torsion sampling for methods Colorful CNN and FCHL should also provide a decrease in MAE for predicting dihedral angle energies. This could improve accuracy for the Colorful CNN method that is already qualitatively adequate.
As an example, the ANI-2x training indicates additional torsion sampling, and the method shows improved accuracy over ANI-1x. Providing additional torsion sampling training sets