Machine Learning Modeling and Insights into the Structural Foundations of Polymyxin-like Antimicrobials


Antimicrobial resistance (AMR) is a silent pandemic that represents an urgent threat to human health. Unfortunately, the antibiotic development pipeline is slow even though AMR has been escalating uncontrollably fast, namely amongst Gram-negative pathogens. Although out of use until recently due to their toxic side effects, polymyxins have been revived as a last-line therapeutic option since all other antibiotics are currently failing. In an attempt to ameliorate their toxicity and improve antimicrobial activity, many studies have been generating polymyxin analogues through different strategies, mostly empirical. As such, there is still a lack of faster and more reliable approaches to make analog design efficient in order to tackle AMR in a timely fashion. The solution to accelerate the discovery of new drugs probably lies in the use of in silico approaches, such as machine learning, due to their faster pace and time and cost efficiency. In this work, machine learning was applied to Quantitative Structure-Activity Relationship (QSAR) modeling with the objective of providing a working semi-quantitative model capable of predicting the activity of polymyxin-like molecules for a given species. For this, we applied four different learning algorithms and ten different families of molecular descriptors to our dataset of 408 molecule/microorganism pairs retrieved from PubChem. The AdaBoost model devised using the CKP set of descriptors was the best performer, with good accuracies and very low false negative and positive predictions. Preliminary exploration of the model's response to systematic changes in the structure of polymyxin B reveals a trend towards increased antimicrobial activity when exchanging some of its constituent amino acids for more lipophilic ones. Experimental studies are already underway based on this model's application and we believe it will become a crucial tool for drug development.


Supplementary material

Electronic Supporting Information
Software code for using the final model, scores of all tested ML models, optimized hyper-parameters for all random forest and AdaBoost models, and partial dependence plots for the features with less than 10\% PI
Collected data set