Abstract
Bioactive peptides are an important class of natural products with great functional diversity. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for canonical peptides (non-modified) is more abundant than for non-canonical (chemically modified). We set out to identify the feasibility of generalizing from canonical to non-canonical datasets. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model. We demonstrate, across four peptide function prediction tasks, that chemical fingerprint-based similarity measures outperform traditional sequence alignment-based metrics for partitioning canonical peptide datasets, challenging standard practices. Similarly, we have found that chemical fingerprints are the best option for building canonical-to-canonical and non-canonical-to-non-canonical predictive models. for the more challenging canonical-to-non-canonical and non-canonical-to-canonical extrapolation scenarios. Finally, we discovered that by enriching the canonical datasets with non-canonical peptides, we are able to build robust joint models that generalise adequately to both canonical and non-canonical data. All code and data necessary for reproducing the experiments are available in Github (https://github.com/IBM/PeptideGeneralizationBenchmarks)
Supplementary weblinks
Title
Code and data
Description
Github repository containing the code and data necessary for reproducing the experiments in the paper.
Actions
View