Abstract
Deep learning models trained with chemical string representations such as SMILES, referred to as chemical language models (CLMs), can learn chemical features relevant for molecular characteristics like bioactivity. For this purpose, CLMs are typically fine-tuned with active molecules to achieve a task-specific bias towards a region of interest in the chemical space. Here, we present a way to augment CLM development with inactive molecules by incorporating an activity label for self-supervised learning. We capitalize on this activity information and establish a CLM for bioactivity prediction of drug molecules. Retrospective evaluation of this model demonstrated superior target prediction performance and prospective application identified multiple novel modulators for pharmacologically relevant targets with innovative features. The model also robustly predicted activity profiles of approved and experimental drugs and the activity label allowed extraction of structure-prediction relationships as new opportunity to improve explainability of CLMs. These results expand the scope of CLMs and corroborate their use for bioactivity prediction.
Supplementary materials
Title
Supplementary Information
Description
Supplementary Figures & Tables
Supplementary Methods
Actions
Title
Supplementary Table 3 (top100_predictions)
Description
Top-ranking 100 molecules from the virtual screening application
Actions