Abstract
Proteochemometric models (PCM) are used in computational drug discovery to leverage both protein and ligand representations for bioactivity prediction. While machine learning (ML) and deep learning (DL) have come to dominate PCMs, often serving as scoring functions, rigorous evaluation standards have not always been consistently applied. In this study, using kinase-ligand bioactivity prediction as a model system, we highlight the critical roles of dataset curation, permutation testing, class imbalances, data splitting strategies, and embedding quality in determining model performance. Our findings indicate that data splitting and class imbalances are the most critical factors affecting PCM performance, emphasizing the challenges in generalizing ability of ML/DL-PCMs. We evaluated various protein-ligand descriptors and embeddings, including those augmented with multiple sequence alignment (MSA) information. However, permutation testing consistently demonstrated that protein embeddings contributed minimally to PCM efficacy. This study advocates for the adoption of stringent evaluation standards to enhance the generalizability of models to out-of-distribution data and improve benchmarking practices.
Supplementary materials
Title
Supplementary information and figures
Description
Dataset curation
Baseline model dataset
Hyperparameter tuning
Implementation of Convolutional Autoencoder
Actions
Title
Supplementary tables
Description
Raw statistical data analysis and ANOVA results
Actions