3D Convolutional Neural Networks and a CrossDocked Dataset for Structure-Based Drug Design

Paul Francoeur; Tomohide Masuda; David R. Koes

doi:10.26434/chemrxiv.11833323.v2

Biological and Medicinal Chemistry

Search within Biological and Medicinal Chemistry

3D Convolutional Neural Networks and a CrossDocked Dataset for Structure-Based Drug Design

04 March 2020, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard dataset of sufficient size to compare performance between models. We present a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank and perform a comprehensive evaluation of grid-based convolutional neural network models on this dataset. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind dataset, how performance improves by adding more, lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of 5 densely connected convolutional newtworks, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized dataset for training machine learning models to recognize ligands in non-cognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this dataset for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.

Keywords

Protein-Ligand Binding Affinity

Structure Based Drug Design

Supplementary materials

Title

Description

Actions

Title

crossdocked2020 supplement

Description

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design

Paul G. Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B. Iovanisci, Ian Snyder, David R. Koes journal article

Journal of Chemical Information and Modeling , Volume 60, Issue 9

Online publication date: Aug 31, 2020

Version History

Mar 04, 2020 Version 2

Feb 21, 2020 Version 1

Version Notes

version 1.1 -- added supplemental pdf.

Metrics

9,194

3,755

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv.11833323.v2

Funding

National Institute of General Medical Sciences

R01GM108340

Methods, Tools and Resources for Interactive Online Virtual Screening and Lead Optimization

https://app.dimensions.ai/details/grant/grant.2522151

TG-MCB190049

ACI-1548562

Author’s competing interest statement

None

3D Convolutional Neural Networks and a CrossDocked Dataset for Structure-Based Drug Design

Authors

Abstract

Keywords

Supplementary materials

Comments

Now Published

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Share