Abstract
Antimicrobial peptides (AMPs) have gained significant attention in the field of drug discovery due to their potential therapeutic applications in the fight against antimicrobial resistance. Since rationally designing AMPs is notoriously difficult due to the vast number of possible peptide sequences and their complex structure-activity relationship landscape, this problem is ideally suited for machine-learning models, which can be trained from available data to predict new sequences with a desired activity profile. Here we investigated the performance of large language models (LLMs) fine-tuned with data from Database of Antimicrobial Activity and Structure of Peptides (DBAASP) to predict AMP antimicrobial activity and hemolysis from their amino acid sequence. We show that GPT-3 based models perform slightly better than previously reported recurrent neural networks (RNN) and related architectures on comparable datasets. Furthermore, GPT-3 based models perform remarkably well on low data regime. Advantages in terms of training time and costs are also discussed.
Supplementary weblinks
Title
GitHub repository
Description
all training data (peptide sequences annotated with activities) and code (code to access the models) to reproduce the results
Actions
View