Machine Learning-Boosted Docking Enables the Efficient Structure-Based Virtual Screening of Giga-Scale Enumerated Chemical Libraries

07 August 2023, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The emergence of ultra-large screening libraries, filled to the brim with billions of readily available compounds, poses a growing challenge for docking-based virtual screening. Machine Learning (ML)-boosted strategies like the tool HASTEN combine rapid ML prediction with the brute-force docking of small fractions of such libraries to increase screening throughput and take on giga-scale libraries. In our case study of an anti-bacterial chaperone and an anti-viral kinase, we first generated a brute-force docking baseline for 1.56 billion compounds in the Enamine REAL lead-like library with the fast Glide HTVS protocol. With HASTEN, we observed robust recall of 90% of the true 1000 top-scoring virtual hits in both targets when docking only 1% of the entire library. This reduction of the required docking experiments by 99% significantly shortens the screening time.In the kinase target, the employment of a hydrogen bonding constraint resulted in a major proportion of unsuccessful docking attempts and hampered ML predictions. We demonstrate the optimization potential in the treatment of failed compounds when performing ML-boosted screening and benchmark and showcase HASTEN as a fast and robust tool in a growing arsenal of approaches to unlock the chemical space covered by giga-scale screening libraries for everyday drug discovery campaigns.

Keywords

Machine Learning
Virtual screening
ultra-large scale docking
SurA
GAK
HASTEN
Glide
Chemprop
Drug discovery

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
Supporting Figures S1-S10. Supporting Tables S1-S8. Summary of utilized Chemprop parameters. Extended methodology: GAK receptor selection and docking method validation.
Actions
Title
GAK lead-like actives used for method validation as obtained from ChEMBL
Description
This spreadsheet contains identifiers, SMILES and activity data of GAK actives (defined as IC50, Kd or Ki of at least 1 µM) as obtained from ChEMBL (15/12/2022) alongside the corresponding ChEMBL assay ID and the original source DOI. These compounds were used for docking method validation in the manuscript.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.