MF-PCBA: Multi-fidelity high-throughput screening benchmarks for drug discovery and machine learning

David Buterez; Jon Paul Janet; Steven J. Kiddle; Pietro Liò

doi:10.26434/chemrxiv-2022-cb3tz

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

MF-PCBA: Multi-fidelity high-throughput screening benchmarks for drug discovery and machine learning

14 November 2022, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

High-throughput screening (HTS), as one of the key techniques in drug discovery, is frequently used to identify promising drug candidates in a largely automated and cost-effective way. One of the necessary conditions for successful HTS campaigns is a large and diverse compound library, enabling hundreds of thousands of activity measurements per project. Such collections of data hold great promise for computational and experimental drug discovery efforts, especially when leveraged in combination with modern deep learning techniques, and potentially leading to improved drug activity predictions and cheaper and more effective experimental design. However, existing collections of machine learning ready public datasets do not exploit the multiple data modalities present in real-world HTS projects. Thus, the largest fraction of experimental measurements, corresponding to hundreds of thousands of 'noisy' activity values from primary screening, are effectively ignored in the majority of machine learning models of HTS data. To address these limitations, we introduce MF-PCBA (Multi Fidelity PubChem BioAssay), a curated collection of 60 datasets that includes two data modalities for each dataset, corresponding to primary and confirmatory screening, an aspect that we call multi-fidelity. Multi-fidelity data accurately reflects real-world HTS conventions, and presents a new, challenging task for machine learning: the integration of low and high-fidelity measurements through molecular representation learning, taking into account the orders-of-magnitude difference in size between the primary and confirmatory screens. Here, we detail the steps taken to assemble MF-PCBA, in terms of data acquisition from PubChem and the filtering steps required to curate the raw data. We also provide an evaluation of a recent, deep-learning based method for multi-fidelity integration across the introduced datasets, demonstrating the benefit of leveraging all HTS modalities, and a discussion in terms of the roughness of the molecular activity landscape. In total, MF-PCBA contains over 16.6 million unique molecule-protein interactions. The datasets can be easily assembled by using the source code available at https://github.com/davidbuterez/mf-pcba.

Keywords

high-throughput screening

concentration response

artificial intelligence

computational

graph neural network

gnn

graph representation learning

support vector machine

Supplementary materials

Title

Description

Actions

Title

Supplementary Information

Description

Additional figures and tables supporting the presentation and analysis in the main text.

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning

David Buterez, Jon Paul Janet, Steven J. Kiddle, Pietro Liò journal article

Journal of Chemical Information and Modeling , Volume 63, Issue 9

Online publication date: Apr 14, 2023

Version History

Nov 14, 2022 Version 1

Metrics

1,304

645

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2022-cb3tz

Funding

AstraZeneca

Author’s competing interest statement

DB's doctoral studies are funded by AstraZeneca. JPJ and SJK are employed by AstraZeneca and potentially hold shares in the company.

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

MF-PCBA: Multi-fidelity high-throughput screening benchmarks for drug discovery and machine learning

Authors

Abstract

Keywords

Supplementary materials

Comments

Now Published

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share