A Very Large-Scale Bioactivity Comparison of Deep Learning and Multiple Machine Learning Algorithms for Drug Discovery

Thomas R. Lane; Daniel H. Foil; Eni Minerali; Fabio Urbina; Kimberley M. Zorn; sean ekins

doi:10.26434/chemrxiv.12781241.v1

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

A Very Large-Scale Bioactivity Comparison of Deep Learning and Multiple Machine Learning Algorithms for Drug Discovery

10 August 2020, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies we have applied multiple machine learning algorithms, modeling metrics and in some cases compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and comparison of our proprietary software Assay Central^TM with random forest, k-Nearest Neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (3 levels). Model performance was assessed using an array of five-fold cross-validation metrics including area-under-the-curve, F1 score, Cohen’s kappa and Matthews correlation coefficient. Based on ranked normalized scores for the metrics or datasets all methods appeared comparable while the distance from the top indicated Assay Central^TM and support vector classification were comparable. Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case where minimal tuning was performed of any of the methods. If anything, Assay Central^TM may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay Central^TMperformance, but support vector classification seems to be a strong competitor. We also apply Assay Central^TM to prospective predictions for PXR and hERG to further validate these models. This work currently appears to be the largest comparison of machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors and algorithms, as well as further refining methods for evaluating and comparing models.

Keywords

Supplementary materials

Title

Description

Actions

Title

Assay Central Mega final suppl material

Description

Actions

Title

Assay Central Mega final

Description

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Aug 10, 2020 Version 1

Metrics

2,865

777

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv.12781241.v1

Funding

R44GM122196-02A1

3R43AT010585-01S1

1R43ES031038-01

Author’s competing interest statement

S.E., D.H.F., E.M., K.M.Z., and T.R.L. work for Collaborations Pharmaceuticals, Inc. F.U. has no conflicts of interest.

A Very Large-Scale Bioactivity Comparison of Deep Learning and Multiple Machine Learning Algorithms for Drug Discovery

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Share