These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
2 files

SoluProt: Prediction of Soluble Protein Expression in Escherichia coli

submitted on 03.10.2020, 11:10 and posted on 05.10.2020, 12:17 by Jiri Hon, Martin Marusiak, Tomas Martinek, Antonin Kunka, Jaroslav Zendulka, David Bednar, Jiri Damborsky

Motivation: Poor protein solubility hinders the production of many therapeutic and industrially useful proteins. Experimental efforts to increase solubility are plagued by low success rates and often reduce biological activity. Computational prediction of protein expressibility and solubility in Escherichia coli using only sequence information could reduce the cost of experimental studies by enabling prioritisation of highly soluble proteins.

Results: A new tool for sequence-based prediction of soluble protein expression in Escherichia coli, SoluProt, was created using the gradient boosting machine technique with the TargetTrack database as a training set. When evaluated against a balanced independent test set derived from the NESG database, SoluProt’s accuracy of 58.4% and AUC of 0.60 exceeded those of a suite of alternative solubility prediction tools. There is also evidence that it could significantly increase the success rate of experimental protein studies. SoluProt is freely available as a standalone program and a user-friendly webserver at

Availability and Implementation:


Supplementary Information: Supplementary data are available at Bioinformatics online


Email Address of Submitting Author


Masaryk University


Czech Republic

ORCID For Submitting Author


Declaration of Conflict of Interest

No conflict of interest.