Bioactivity prediction with chemical language models trained on labeled molecules

Laura Isigkeit; Tim Hörmann; Vittorio Lembo; Johanna Ehrler; Ewgenij Proschak; Francesca Grisoni; Daniel Merk

doi:10.26434/chemrxiv-2025-lwnjs

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Bioactivity prediction with chemical language models trained on labeled molecules

23 May 2025, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Deep learning models trained with chemical string representations such as SMILES, referred to as chemical language models (CLMs), can learn chemical features relevant for molecular characteristics like bioactivity. For this purpose, CLMs are typically fine-tuned with active molecules to achieve a task-specific bias towards a region of interest in the chemical space. Here, we present a way to augment CLM development with inactive molecules by incorporating an activity label for self-supervised learning. We capitalize on this activity information and establish a CLM for bioactivity prediction of drug molecules. Retrospective evaluation of this model demonstrated superior target prediction performance and prospective application identified multiple novel modulators for pharmacologically relevant targets with innovative features. The model also robustly predicted activity profiles of approved and experimental drugs and the activity label allowed extraction of structure-prediction relationships as new opportunity to improve explainability of CLMs. These results expand the scope of CLMs and corroborate their use for bioactivity prediction.

Supplementary materials

Title

Description

Actions

Title

Supplementary Information

Description

Supplementary Figures & Tables Supplementary Methods

Actions

Title

Supplementary Table 3 (top100_predictions)

Description

Top-ranking 100 molecules from the virtual screening application

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

May 23, 2025 Version 1

Metrics

358

236

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2025-lwnjs

Funding

European Research Council

101040355

Innovative Medicines Initiative

875510

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Bioactivity prediction with chemical language models trained on labeled molecules

Authors

Abstract

Supplementary materials

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share