How to generalize machine learning models to both canonical and non-canonical peptides

Raúl Fernández-Díaz; Rodrigo Ochoa; Thanh Lam Hoang; Vanessa Lopez; Denis Shields

doi:10.26434/chemrxiv-2025-ggp8n-v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

How to generalize machine learning models to both canonical and non-canonical peptides

12 May 2025, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Bioactive peptides are an important class of natural products with great functional diversity. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for canonical peptides (non-modified) is more abundant than for non-canonical (chemically modified). We set out to identify the feasibility of generalizing from canonical to non-canonical datasets. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model. We demonstrate, across four peptide function prediction tasks, that chemical fingerprint-based similarity measures outperform traditional sequence alignment-based metrics for partitioning canonical peptide datasets, challenging standard practices. Similarly, we have found that chemical fingerprints are the best option for building canonical-to-canonical and non-canonical-to-non-canonical predictive models. for the more challenging canonical-to-non-canonical and non-canonical-to-canonical extrapolation scenarios. Finally, we discovered that by enriching the canonical datasets with non-canonical peptides, we are able to build robust joint models that generalise adequately to both canonical and non-canonical data. All code and data necessary for reproducing the experiments are available in Github (https://github.com/IBM/PeptideGeneralizationBenchmarks)

Keywords

AutoML

Peptides

Representation learning

Benchmark

Chemical Language Models

Supplementary weblinks

Title

Description

Actions

Title

Code and data

Description

Github repository containing the code and data necessary for reproducing the experiments in the paper.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

May 12, 2025 Version 2

Mar 24, 2025 Version 1

Version Notes

Change format of results into tables to make them easier to interpret. Minor improvements to the expression. Additional experiments introduced.

Metrics

1,114

517

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2025-ggp8n-v2

Funding

Science Foundation Ireland

18/CRT/6214 to R.F.D.

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

How to generalize machine learning models to both canonical and non-canonical peptides

Authors

Abstract

Keywords

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share