How to generalize machine learning models to both canonical and non-canonical peptides

24 March 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Bioactive peptides are an important class of natural products with great functional diversity. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for canonical peptides (non-modified) is more abundant than for non-canonical (chemically modified). We explored whether current methods are sufficient to generalize from canonical data to non-canonical datasets. To do this, we first considered two critical aspects of the modeling problem, namely choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model. We demonstrate, across four peptide function prediction tasks, that chemical fingerprint-based similarity measures outperform traditional sequence alignment-based metrics for partitioning canonical peptide datasets, challenging standard practices. We have also found that the deep-learned embeddings from Chemical Language Models (CLMs) generally outperform chemical fingerprints and other peptide-specific pre-trained models, performing best for non-canonical peptides and second best for canonical. Despite this, models trained on only one of the two peptide classes fail to properly extrapolate to the other. However, by enriching the canonical datasets with a small proportion of non-canonical peptides, we are able to build robust joint models that generalise adequately to both canonical and non-canonical data. All code and data necessary for reproducing the experiments are available in Github (https://github.com/IBM/AutoPeptideML/tree/peptide-rep-gen).

Keywords

AutoML
Peptides
Representation learning
Benchmark
Chemical Language Models

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.