Enhancing Semiempirical Quantum Mechanical Scoring with Machine Learning: a new scoring function that accounts for both the enthalpic and entropic contributions to the ligand binding free energy

23 December 2022, Version 1
This content is a preprint and has not undergone peer review at the time of posting.


Identifying hit compounds is a principal step in early-stage drug discovery. While many machine learning (ML) approaches have been proposed, in the absence of binding data, molecular docking is the most widely used option to predict binding modes and score hundreds of thousands of compounds for binding affinity to the target protein. Docking's effectiveness is critically dependent on the protein-ligand (P-L) scoring function (SF), thus re-scoring with more rigorous SFs is a common practice. In this pilot study, we scrutinize the PM6-D3H4X/COSMO semi-empirical quantum mechanical (SQM) method as a docking pose re-scoring tool on 17 diverse receptors and ligand decoy sets, totaling 1.5 million P-L complexes. We investigate the effect of explicitly computed ligand conformational entropy and ligand deformation energy on SQM P-L scoring in a virtual screening (VS) setting, as well as molecular mechanics (MM) versus hybrid SQM/MM structure optimization prior to re-scoring. Our results proclaim that there is no obvious benefit from computing ligand conformational entropies or deformation energies and that optimizing only the ligand's geometry on the SQM level is sufficient to achieve the best possible scores. Instead, we leverage machine learning (ML) to include implicitly the missing entropy terms to the SQM score using ligand topology, physicochemical, and P-L interaction descriptors. Our new hybrid scoring function, named SQM-ML, is transparent and explainable, and achieves in average 9\% higher AUC-ROC than PM6-D3H4X/COSMO and 3\% higher than Glide SP, but with consistent and predictable performance across all test sets, unlike the former two SFs, whose performance is considerably target-dependent and sometimes resembles that of a random classifier. The code to prepare and train SQM-ML models is available at \url{https://github.com/tevang/sqm-ml.git} and we believe that will pave the way for a new generation of hybrid SQM/ML protein-ligand scoring functions.


drug design
quantum chemistry
machine learning
hit discovery
virtual screening

Supplementary weblinks


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.