How to generalize machine learning models to both canonical and non-canonical peptides

12 May 2025, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Bioactive peptides are an important class of natural products with great functional diversity. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for canonical peptides (non-modified) is more abundant than for non-canonical (chemically modified). We set out to identify the feasibility of generalizing from canonical to non-canonical datasets. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model. We demonstrate, across four peptide function prediction tasks, that chemical fingerprint-based similarity measures outperform traditional sequence alignment-based metrics for partitioning canonical peptide datasets, challenging standard practices. Similarly, we have found that chemical fingerprints are the best option for building canonical-to-canonical and non-canonical-to-non-canonical predictive models. for the more challenging canonical-to-non-canonical and non-canonical-to-canonical extrapolation scenarios. Finally, we discovered that by enriching the canonical datasets with non-canonical peptides, we are able to build robust joint models that generalise adequately to both canonical and non-canonical data. All code and data necessary for reproducing the experiments are available in Github (https://github.com/IBM/PeptideGeneralizationBenchmarks)

Keywords

AutoML
Peptides
Representation learning
Benchmark
Chemical Language Models

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.