Protein pKa prediction by tree-based machine learning

14 December 2021, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

We present four tree-based machine learning models for protein pKa prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pKa datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pKa prediction tool PROPKA. The overall RMSE for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys and Tyr), and 0.63 when considering Asp, Glu, His and Lys only. We provide pKa predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted pKa values close to the physiological pH.

Keywords

machine learning
Extra Trees
XGBoost
LightGBM
protein pKa prediction
pKa
AlphaFold Database
Random Forest

Supplementary materials

Title
Description
Actions
Title
Supplementary Information for Protein pKa prediction by tree-based machine learning
Description
Hyperparameters being tuned and their ranges; and distribution of pKa values in training sets; and complete feature importance ranking; and distribution of features for proteins in the human proteome from the AlphaFold Protein Structure Database
Actions
Title
pKa predictions for AlphaFold structures
Description
Predicted pKa values for proteins in the human proteome from the AlphaFold Protein Structure Database
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.