Multi-fidelity machine learning models for improved high-throughput screening predictions

David Buterez; Jon Paul Janet; Steven Kiddle; Pietro Liò

doi:10.26434/chemrxiv-2022-dsbm5

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Multi-fidelity machine learning models for improved high-throughput screening predictions

19 May 2022, Version 1

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

High throughput screening (HTS) is one of the leading techniques for hit identification in drug discovery and is often done in two phases, primary and confirmatory. The resulting data is multi-fidelity, with noisy primary screening data available on a large number of compounds and higher quality confirmatory data on a low-to-moderate number of compounds. Existing computational pipelines do not integrate primary and confirmatory screening data of individual HTS campaigns, resulting in millions of potentially useful screening data points being unused in models of confirmatory bioactivity prediction. Furthermore, there is currently a lack of publicly available multi-fidelity bioactivity benchmarks to support modelling real-world high-throughput screening data. To address these challenges, we first compiled a public collection of 23 multi-fidelity HTS datasets from PubChem for benchmarking, including more than 6.1 million data points. Additionally, we assembled a private collection of 19 AstraZeneca HTS datasets, spanning more than 22.8 million data points. We then designed and evaluated machine learning models to assess the improvements offered by the integration of multi-fidelity data, including classical machine learning and novel deep learning approaches, the latter based on graph neural networks. Jointly modelling primary and confirmatory data led to a decrease of 12% in mean absolute error (MAE) and an increase of 152% in R-squared on the public datasets, and a reduction of 17% in MAE coupled with an uplift of 46% in R-squared on the AstraZeneca datasets (averaged across all evaluated methods). Furthermore, supplementing with molecular embeddings produced by previously trained deep learning models led to improved metrics for compounds that were not part of the primary screen, with up to double the baseline performance. We conclude that joint modelling of multi-fidelity HTS data improves predictive performance and that deep learning enables the use of unique and highly desirable strategies such as leveraging signals from multi-million scale datasets and transfer learning.

Keywords

high-throughput screening

concentration response

artificial intelligence

computational

graph neural network

gnn

graph representation learning

support vector machine

Supplementary materials

Title

Description

Actions

Title

Supplementary Information

Description

The materials include additional figures and tables, model hyperparameters, and details about the methodology and evaluation.

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jan 19, 2024 Version 3

Jul 26, 2022 Version 2

May 19, 2022 Version 1

Metrics

4,196

2,000

Views

Downloads

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2022-dsbm5

Funding

AstraZeneca

Author’s competing interest statement

DB's doctoral studies are funded by AstraZeneca.

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Multi-fidelity machine learning models for improved high-throughput screening predictions

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share