Abstract
Protein-ligand binding affinity assessment plays a pivotal role in virtual drug screening, yet conventional data-driven approaches rely heavily on limited protein-ligand crystal structures. Structure-free compound-protein interaction (CPI) methods have emerged as competitive alternatives, leveraging extensive bioactivity data to serve as more robust scoring functions. However, these methods often overlook two critical challenges that affect data efficiency and modeling accuracy: the heterogeneity of bioactivity data due to differences in bioassay measurements, and the presence of activity cliffs (ACs)—small chemical modifications that lead to significant changes in bioactivity, which have not been thoroughly investigated in CPI modeling. To address these challenges, we present CPI2M, a large-scale CPI benchmark dataset containing approximately 2 million bioactivity endpoints across four activity types (Ki, Kd, EC50, and IC50) with AC annotations. Moreover, we developed GGAP-CPI, a structure-free deep learning model trained by integrated bioactivity learning and designed to mitigate the impact of ACs on CPI prediction through advanced protein representation modelling and integrated bioactivity learning. Our comprehensive evaluation demonstrates that GGAP-CPI outperforms 12 target-specific and 7 general CPI baselines across four key scenarios (general CPI prediction, rare protein prediction, transfer learning, and virtual screening) on seven benchmarks (CPI2M, MoleculeACE, CASF-2016, MerckFEP, DUD-E, DEKOIS-v2, and LIT-PCBA). Furthermore, GGAP-CPI not only delivers stable predictions by distinguishing bioactivity differences between ACs and non-ACs, but also enriches binding pocket residues and interactions, underscoring its applicability to real-world binding affinity assessments and virtual drug screening.
Supplementary materials
Title
Supporting Information
Description
Supporting Information
Actions