Abstract
In the rapidly evolving field of drug discovery, High Throughput Screening (HTS) is a pivotal technique for identifying promising compounds. Despite its wide usage, the primary challenge remains in efficiently sifting through vast chemical libraries to discern true bioactive compounds from false positives. This study introduces a novel application of data valuation methods in machine learning to address this challenge, offering a multi-faceted approach to improving drug discovery pipelines. Our comprehensive strategy encompasses enhancing active learning for efficient compound library screening, robust identification of false and true positives in primary HTS data, and optimizing HTS datasets for machine learning applications through targeted undersampling. We demonstrate that influence-based methods enable more effective batch screening of chemical libraries, thereby reducing the need for extensive HTS, and provide significant advancements over current false positive detection techniques. This is achieved by employing machine learning models that accurately distinguish between true biological activity and assay artifacts, thereby streamlining the drug discovery process. Furthermore, our method applies smart undersampling to balance HTS datasets, enhancing the performance of machine learning algorithms without the risk of omitting crucial inactive samples. The implications of these developments are far-reaching, offering a potential paradigm shift in the efficiency and accuracy of drug development processes. We provide a benchmarking platform to facilitate the application of these methods, ensuring easy integration and modification for a broad range of datasets, thus propelling the scientific community towards more effective drug discovery methodologies (Available on GitHub at: https://github.com/JoshuaHesse/DataValuationPlatform).
Supplementary materials
Title
Supporting Information
Description
Supporting information for the manuscript, containing more in depth information about data curation, as well as additional information and data for the experiments.
Actions
Supplementary weblinks
Title
DataValuationPlatform Github repository
Description
Github repository for the DataValuationPlatform created to allow easy application of the discussed data valuation methods to high throughput screen data
Actions
View