Abstract
The effectiveness of machine learning (ML) in drug discovery hinges on evaluation and modeling approaches that align with how compounds are tested and compared in real experimental contexts. We observe that experimental data in public repositories like ChEMBL naturally clusters by assay origin, while retaining significant overlap between training and test sets even when using common splitting strategies. This clustering effect is notable as simply predicting the mean activity value from a training assay yields surprisingly strong performance on test compounds from the same assay. To address these observations, we propose a paradigm for ML in drug discovery that respects the inherent structure of aggregated experimental data. We implement this approach through: (1) data splitting that allocates entire assays to either training or test sets, (2) evaluation metrics that assess ranking performance within individual assays rather than absolute prediction accuracy across heterogeneous experiments, and (3) set-based ranking models trained specifically on compound sets drawn from the same assay rather than random sets. Evaluating our approach on three datasets derived from ChEMBL, we demonstrate that ranking models trained on intra-assay sets consistently outperform both traditional IC50 prediction and ranking models trained on arbitrary compound sets. This performance advantage is most pronounced when data is curated minimally, suggesting that our approach effectively mitigates inconsistencies between experiments. Our findings indicate that ML methods for drug discovery should prioritize intra-assay ranking capability over absolute value prediction when working with aggregated experimental data.