GPepT: A foundation language model for peptidomimetics incorporating non-canonical amino acids

08 April 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Language models have been increasingly popular in therapeutic peptide generation, but molecular diversity remains limited due to reliance on the 20 canonical amino acids. We propose a language model that generates peptidomimetics incorporating non-canonical elements like non-canonical amino acids and terminal modifications. To accomplish this, we created a vocabulary of over 17,000 non-canonical elements by extracting them from chemical formulas stored in the ChEMBL database. Our pretrained language model, GPepT, showed improved diversity in molecular structures and chemical properties. To demonstrate its real-world application, we fine-tuned the model for antimicrobial peptides. Experimental validation revealed that one of the generated peptidomimetics exhibited effective antimicrobial activity, marking a successful case of AI-driven peptide development. GPepT is fully accessible on HuggingFace: https://huggingface.co/Playingyoyo/GPepT.

Keywords

Monomerizer
GPepT
GPT
Protein
Peptide
Peptidomimetics
RDKit
SMILES
SMILES2Seq
Non-canonical amino acids
Amino acids

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
Section S1: Algorithmic details of Monomerizer. Figure S1: Comparison of non-canonical amino acids (ncAAs), terminal modifications and canonical amino acids (cAAs) mined from ChEMBL. (a) t-SNE visualization of Morgan fingerprints. (b) Distribution of physiochemical properties. Figure S2: Comparison of peptidomimetics and peptides mined from ChEMBL (Dataset P). (a) t-SNE visualization of Morgan fingerprints. (b) Distribution of physiochemical properties. Table S1: Valid peptidomimetics chosen for antimicrobial activity test.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.