UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines

Qianrong Guo; Saiveth Hernandez; Pedro Ballester

doi:10.26434/chemrxiv-2024-f1v2v-v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines

15 May 2025, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Virtual Screening (VS) of large compound libraries using Artificial Intelligence (AI) models is a highly effective approach for early drug discovery. Data splitting is crucial for benchmarking the performance of such AI models. Traditional ran-dom data splits often result in structurally similar molecules in both training and test sets, which conflict with the reality of VS libraries that typically contain structurally diverse compounds. To tackle this challenge, scaffold split, which groups molecules by shared core structure, and Butina clustering, which clusters molecules by chemotypes, have long been used. However, we show that these methods still introduce high similarities between training and test sets, leading to overestimated model performance. Our study examined four representative AI models across 60 NCI-60 datasets, each comprising approximately 33,000 to 54,000 molecules tested on different cancer cell lines. Each dataset was split in four ways: random, scaffold, Butina clustering and the more realistic Uniform Manifold Approximation and Projection (UMAP) clustering. Using Linear Re-gression, Random Forest, Transformer-CNN, and GEM, we trained a total of 8,400 models and evaluated under four splitting methods. These comprehensive results show that UMAP split provides more challenging and realistic bench-marks for model evaluation, followed by Butina splits, then scaffold splits and closely after random splits. Consequently, we recommend using UMAP splits instead of overly optimistic Butina splits and especially scaffold splits for mo-lecular property prediction, including VS. Lastly, we illustrate how misaligned ROC AUC is with VS goals, despite its common use. The code and datasets for reproducibility are available at https://github.com/Rong830/UMAP_split_for_VS and archived in https://zenodo.org/records/14736486

Keywords

Artificial Intelligence

Virtual Screening

Benchmarking

QSAR

Molecular Property Prediction

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

May 15, 2025 Version 2

Dec 10, 2024 Version 1

Version Notes

Clarifying the text and experiments shown in new Figure 2.

Metrics

963

485

Views

Downloads

Citations

License

The content is available under CC BY NC 4.0

DOI

10.26434/chemrxiv-2024-f1v2v-v2

Funding

The Royal Society

Wolfson Foundation

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines

Authors

Abstract

Keywords

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share