Protein pKa prediction by tree-based machine learning

Ada Y. Chen; Juyong Lee; Ana Damjanovic; Bernard R. Brooks

doi:10.26434/chemrxiv-2021-4d420

Biological and Medicinal Chemistry

Search within Biological and Medicinal Chemistry

Protein pKa prediction by tree-based machine learning

14 December 2021, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

We present four tree-based machine learning models for protein pKa prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pKa datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pKa prediction tool PROPKA. The overall RMSE for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys and Tyr), and 0.63 when considering Asp, Glu, His and Lys only. We provide pKa predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted pKa values close to the physiological pH.

Keywords

protein pKa prediction

pKa

AlphaFold Database

Random Forest

Supplementary materials

Title

Description

Actions

Title

Supplementary Information for Protein pKa prediction by tree-based machine learning

Description

Hyperparameters being tuned and their ranges; and distribution of pKa values in training sets; and complete feature importance ranking; and distribution of features for proteins in the human proteome from the AlphaFold Protein Structure Database

Actions

Title

pKa predictions for AlphaFold structures

Description

Predicted pKa values for proteins in the human proteome from the AlphaFold Protein Structure Database

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Protein pKa Prediction by Tree-Based Machine Learning

Ada Y. Chen, Juyong Lee, Ana Damjanovic, Bernard R. Brooks journal article

Journal of Chemical Theory and Computation , Volume 18, Issue 4

Online publication date: Mar 15, 2022

Version History

Dec 14, 2021 Version 1

Metrics

1,355

823

Views

Downloads

Citations

License

The content is available under CC BY NC 4.0

DOI

10.26434/chemrxiv-2021-4d420

Funding

National Heart, Lung, and Blood Institute

75N92019P00048

National Heart, Lung, and Blood Institute

75N92020P00042

National Institutes of Health

ZIA HL001051

National Research Foundation of Korea

2019M3E5D4066897

National Research Foundation of Korea

2019M3E5D4066898

National Research Foundation of Korea

2018R1C1B600543513

National Research Foundation of Korea

2020R1F1A1075998

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Protein pKa prediction by tree-based machine learning

Authors

Abstract

Keywords

Supplementary materials

Comments

Now Published

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share