Combined physics- and machine-learning-based method to identify druggable binding sites using SILCS-Hotspots

20 August 2024, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Identifying druggable binding sites on proteins is an important and challenging problem, particularly for cryptic, allosteric binding sites that may not be obvious from X-ray, cryo-EM, or predicted structures. The Site-Identification by Ligand Competitive Saturation (SILCS) method accounts for the flexibility of the target protein using all-atom molecular simulations that include various small molecule solutes in aqueous solution. During the simulations the combination of protein flexibility and comprehensive sampling of the water and solute spatial distributions can identify buried binding pockets absent in experimentally-determined structures. Previously, we reported a method for leveraging the information in the SILCS sampling to identify binding sites (termed Hotspots) of small mono- or bi-cyclic compounds, a subset of which coincide with known binding sites of drug-like molecules. Here we build in that physics-based approach and present a ML model for ranking the Hotspots according to the likelihood they can accommodate drug-like molecules (e.g. molecular weight > 200 daltons). In the independent validation set, which includes various enzymes and receptors, our model recalls 67% and 89% of experimentally-validated ligand binding sites in the top 10 and 20 ranked Hotspots, respectively. Furthermore, we show that the model’s output Decision Function is a useful metric to predict binding sites and their potential druggability in new targets. Given the utility the SILCS method for ligand discovery and optimization the tools presented represent an important advancement in the identification of orthosteric and allosteric binding sites and the discovery of drug-like molecules targeting those sites.

Keywords

Site identification by ligand competitive saturation
protein-ligand interaction
orthosteric
allosteric
computer-aided drug design
CADD
binding site prediction

Supplementary materials

Title
Description
Actions
Title
Supporting information.
Description
Figure S1: Surface-exposed Hotspot 25 in ERK5. Figure S2: Distribution of Hotspot SASA by protein system. Figure S3. Analysis of the recursive feature elimination and the top two principal components (PCs) of the training set. Figure S4: Ranking based on mean LGFE of each Hotspot. Figure S5: Burial of allosteric binding site between GABABR Active TM domains. Figure S6: CryptoSite predictions for NKG2D (A) and TEM-1 (B). Table S1: List of proteins and ligands used for methods validation. Table S2: Training and validation set Hotspots and ligand distances. Table S3: Stratified 5-fold Cross-validation training of higher-order SVM Classifier with polynomial or radial basis functions kernels and a Random Forest model. Table S4. FDA compound screening for selected Hotspots of TEM-1 and GABABR Active.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.