An Efficient Machine Learning-Based Prediction Model for JAK2 Inhibitor pIC50

Shengyao Liang

doi:10.26434/chemrxiv-2025-3v3gw

Biological and Medicinal Chemistry

Search within Biological and Medicinal Chemistry

An Efficient Machine Learning-Based Prediction Model for JAK2 Inhibitor pIC50

28 April 2025, Version 1

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Shengyao Liang

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

**Background:** Janus Kinase 2 (JAK2) is a key kinase in cellular signal transduction. Its abnormal activation is closely related to various myeloproliferative neoplasms and inflammatory diseases. Developing selective JAK2 inhibitors is an important direction in drug discovery. Accurate prediction of compound inhibitory activity (pIC50) against JAK2 is crucial for accelerating the discovery and optimization of lead compounds. **Objective:** This study aims to utilize public resources from the ChEMBL database, combined with machine learning methods, to build a computational model capable of efficiently and accurately predicting the pIC50 values of JAK2 inhibitors. **Methods:** We collected compounds targeting human JAK2 (ChEMBL ID: CHEMBL2971) and their IC50 (nM) activity data from the ChEMBL database. After data cleaning (retaining only precise values with `standard_relation = '='`) and standardization (converting IC50 to pIC50, retaining the average pIC50 for duplicate compounds), a dataset containing 5546 compounds was finally obtained. RDKit (version 2022.9.5) was used to calculate Morgan fingerprints (radius=2, 2048 bits), MACCS Keys fingerprints (167 bits), and 13 physicochemical and topological descriptors. Based on feature importance calculated during the data processing phase (derived from preliminary model evaluation), the top 350 features were selected. However, due to the absence of some features in the current dataset, the final model used **345** features. The dataset was randomly split into training (n=4436) and test sets (n=1110) at an 80:20 ratio. The XGBoost (eXtreme Gradient Boosting, version 3.0.0) algorithm was used to build the prediction model, and hyperparameters (`learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`) were optimized using 5-fold cross-validation and GridSearchCV. An early stopping strategy was employed during the final model training to prevent overfitting. **Results:** After hyperparameter optimization, the final XGBoost model demonstrated good predictive performance on the independent test set, achieving a coefficient of determination (R²) of 0.7184, a root mean square error (RMSE) of 0.5968, and a mean absolute error (MAE) of 0.4593. Performance metrics on the training set (R²=0.8978) also indicated a good model fit, and the gap between training and test set performance was within an acceptable range, suggesting that overfitting was effectively controlled. **Conclusion:** This study successfully constructed an XGBoost-based prediction model for JAK2 inhibitor pIC50. Utilizing easily accessible molecular descriptors, the model demonstrated high prediction accuracy and robustness on an external test set. This model holds promise as an efficient virtual screening tool to aid the early discovery and optimization process of JAK2 inhibitors.

Keywords

Quantitative Structure-Activity Relationship (QSAR)

Drug Discovery

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

May 06, 2025 Version 3

May 02, 2025 Version 2

Apr 28, 2025 Version 1

Metrics

509

131

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2025-3v3gw

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

An Efficient Machine Learning-Based Prediction Model for JAK2 Inhibitor pIC50

Authors

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share