Abstract
Language models have been increasingly popular in therapeutic peptide generation, but molecular diversity remains limited due to reliance on the 20 canonical amino acids. We propose a language model that generates peptidomimetics incorporating non-canonical elements like non-canonical amino acids and terminal modifications. To accomplish this, we created a vocabulary of over 17,000 non-canonical elements by extracting them from chemical formulas stored in the ChEMBL database. Our pretrained language model, GPepT, showed improved diversity in molecular structures and chemical properties. To demonstrate its real-world application, we fine-tuned the model for antimicrobial peptides. Experimental validation revealed that one of the generated peptidomimetics exhibited effective antimicrobial activity, marking a successful case of AI-driven peptide development. GPepT is fully accessible on HuggingFace: https://huggingface.co/Playingyoyo/GPepT.
Supplementary materials
Title
Supporting Information
Description
Section S1: Algorithmic details of Monomerizer.
Figure S1: Comparison of non-canonical amino acids (ncAAs), terminal modifications and canonical amino acids (cAAs) mined from ChEMBL. (a) t-SNE visualization of Morgan fingerprints. (b) Distribution of physiochemical properties.
Figure S2: Comparison of peptidomimetics and peptides mined from ChEMBL (Dataset P). (a) t-SNE visualization of Morgan fingerprints. (b) Distribution of physiochemical properties.
Table S1: Valid peptidomimetics chosen for antimicrobial activity test.
Actions
Supplementary weblinks
Title
Monomerizer (Github)
Description
Monomerizer (or SMILES2Seq, #SMILES2FASTA) is a pipeline that converts peptides and peptidomimetics, represented as SMILES (chemical formulae), into sequences of amino acids and terminal modifications.
Actions
View Title
GPepT
Description
GPepT is a cutting-edge language model designed to understand and generate sequences in the specialized domain of peptides and peptidomimetics. It serves as a powerful tool for de novo protein design and engineering. As demonstrated in our research, the incorporation of peptidomimetics significantly broadens the chemical space accessible through generated sequences, enabling innovative approaches to peptide-based therapeutics.
Actions
View