Theoretical and Computational Chemistry

Machine learned calibrations to high-throughput molecular excited state calculations



Understanding the excited state properties of molecules provides insights into how they interact with light. These interactions can be exploited to design compounds for photochemical applications, including enhanced spectral conversion of light to increase the efficiency of photovoltaic cells. While chemical discovery is time- and resource-intensive experimentally, computational chemistry can be used to screen large-scale databases for molecules of interest in a procedure known as high-throughput virtual screening. The first step usually involves a high-speed but low-accuracy method to screen large numbers of molecules (potentially millions) so only the best candidates are evaluated with expensive methods. However, use of a coarse first-pass screening method can potentially result in high false positive or false negative rates. Therefore, this study uses machine learning to calibrate a high-throughput technique (xTB-sTDA) against a higher accuracy one (TD-DFT). Testing the calibration model shows a ~6-fold decrease in error in-domain and a ~3-fold decrease out-of-domain. The resulting mean absolute error of ~0.14 eV is in line with previous work in machine learning calibrations and out-performs previous work in linear calibration of xTB-sTDA. We then apply the calibration model to screen a 250k molecule database and map inaccuracies of xTB-sTDA in chemical space. We also show generalizability of the workflow by calibrating against a higher-level technique (CC2), yielding a similarly low error. Overall, this work demonstrates machine learning can be used to develop a both cheap and accurate method for large-scale excited state screening, enabling accelerated molecular discovery across a variety of disciplines.

Version notes

We have added various details to make the paper clearer. We have included references to the delta-ML approach the paper was based on. We re-made Figure 2 to include all training and test datasets. We included additional cross-validation results for the training datasets. We included a comparison of direct vs. delta ML models for the 300k training set. We included additional analysis for the HTVS results section, and for the xTB-sTDA error subsection. Finally, we added additional ML analysis to the CC2 results section, including a transfer learning model and a B3LYP to CC2 calibration model, to help improve the accuracy of xTB calibrations.


Thumbnail image of SVerma_xTB_ML_manuscript_submitted_220228.pdf

Supplementary material

Thumbnail image of SVerma_xTB_ML_manuscript_SI_submitted_220228.pdf
Supplementary Information for "Machine learned calibrations to high-throughput molecular excited state calculations"
Supplementary information includes chemical information about training sets, TD-DFT settings, ML model architecture, and additional plots of dataset calibration, additional details about active learning, further analysis of high-throughput screening results, and additional substructure analysis of xTB-sTDA error categories.

Supplementary weblinks

xTB-ML data
Data repository for paper. Includes raw data and trained ML models.
xTB-ML workflow
Code repository for paper. Includes scripts to run TD-DFT, xTB, and train/test ML models.