UMAP-clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines

Qianrong Guo; Saiveth Hernandez; Pedro Ballester

doi:10.26434/chemrxiv-2024-f1v2v

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

UMAP-clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines

10 December 2024, Version 1

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Virtual Screening (VS) of vast compound libraries guided by Artificial Intelli-gence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional ran-dom data splits produce similar molecules between training and test sets, conflict-ing with the reality of VS libraries which mostly contain structurally distinct compounds. To tackle this challenge, scaffold split, which groups molecules by shared core structure, and Butina clustering, which clusters molecules by their chemotypes, were proposed. In the present study, however, we show that such splitting methods still introduce high similarities between clusters, leading to overestimated model performance. Our study examined three representative AI models on 60 NCI-60 datasets, each with approximately 33,000 to 54,000 mole-cules tested on a different cancer cell line. Each dataset was split with four meth-ods: random, scaffold, Butina clustering and the more realistic Uniform Manifold Approximation and Projection (UMAP) clustering. Regardless of the models, model performances are much worse with UMAP splits from the results of the 300 models trained and evaluated for each algorithm and split. These robust re-sults demonstrate the need for more realistic data splits to tune, compare, and se-lect models for VS. The rigorous UMAP-clustering splits revealed the model generalization remains a gap when the splitting methods changes. The code to re-produce these results is available at https://github.com/Rong830/UMAP_split_for_VS

Keywords

Artificial Intelligence

Virtual Screening

Benchmarking

QSAR

Molecular Property Prediction

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

May 15, 2025 Version 2

Dec 10, 2024 Version 1

Metrics

963

485

Views

Downloads

License

The content is available under CC BY NC 4.0

DOI

10.26434/chemrxiv-2024-f1v2v

Funding

The Royal Society

Wolfson Foundation

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

UMAP-clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines

Authors

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share