Data Valuation: A novel approach for analyzing high throughput screen data using machine learning

12 December 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

In the rapidly evolving field of drug discovery, High Throughput Screening (HTS) is a pivotal technique for identifying promising compounds. Despite its wide usage, the primary challenge remains in efficiently sifting through vast chemical libraries to discern true bioactive compounds from false positives. This study introduces a novel application of data valuation methods in machine learning to address this challenge, offering a multi-faceted approach to improving drug discovery pipelines. Our comprehensive strategy encompasses enhancing active learning for efficient compound library screening, robust identification of false and true positives in primary HTS data, and optimizing HTS datasets for machine learning applications through targeted undersampling. We demonstrate that influence-based methods enable more effective batch screening of chemical libraries, thereby reducing the need for extensive HTS, and provide significant advancements over current false positive detection techniques. This is achieved by employing machine learning models that accurately distinguish between true biological activity and assay artifacts, thereby streamlining the drug discovery process. Furthermore, our method applies smart undersampling to balance HTS datasets, enhancing the performance of machine learning algorithms without the risk of omitting crucial inactive samples. The implications of these developments are far-reaching, offering a potential paradigm shift in the efficiency and accuracy of drug development processes. We provide a benchmarking platform to facilitate the application of these methods, ensuring easy integration and modification for a broad range of datasets, thus propelling the scientific community towards more effective drug discovery methodologies (Available on GitHub at: https://github.com/JoshuaHesse/DataValuationPlatform).

Keywords

Machine Learning
Data Valuation
High Throughput Screening
Active Learning
False Positive Prediction
Undersampling

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
Supporting information for the manuscript, containing more in depth information about data curation, as well as additional information and data for the experiments.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.