An Efficient Machine Learning-Based Prediction Model for JAK2 Inhibitor pIC50

28 April 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

**Background:** Janus Kinase 2 (JAK2) is a key kinase in cellular signal transduction. Its abnormal activation is closely related to various myeloproliferative neoplasms and inflammatory diseases. Developing selective JAK2 inhibitors is an important direction in drug discovery. Accurate prediction of compound inhibitory activity (pIC50) against JAK2 is crucial for accelerating the discovery and optimization of lead compounds. **Objective:** This study aims to utilize public resources from the ChEMBL database, combined with machine learning methods, to build a computational model capable of efficiently and accurately predicting the pIC50 values of JAK2 inhibitors. **Methods:** We collected compounds targeting human JAK2 (ChEMBL ID: CHEMBL2971) and their IC50 (nM) activity data from the ChEMBL database. After data cleaning (retaining only precise values with `standard_relation = '='`) and standardization (converting IC50 to pIC50, retaining the average pIC50 for duplicate compounds), a dataset containing 5546 compounds was finally obtained. RDKit (version 2022.9.5) was used to calculate Morgan fingerprints (radius=2, 2048 bits), MACCS Keys fingerprints (167 bits), and 13 physicochemical and topological descriptors. Based on feature importance calculated during the data processing phase (derived from preliminary model evaluation), the top 350 features were selected. However, due to the absence of some features in the current dataset, the final model used **345** features. The dataset was randomly split into training (n=4436) and test sets (n=1110) at an 80:20 ratio. The XGBoost (eXtreme Gradient Boosting, version 3.0.0) algorithm was used to build the prediction model, and hyperparameters (`learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`) were optimized using 5-fold cross-validation and GridSearchCV. An early stopping strategy was employed during the final model training to prevent overfitting. **Results:** After hyperparameter optimization, the final XGBoost model demonstrated good predictive performance on the independent test set, achieving a coefficient of determination (R²) of 0.7184, a root mean square error (RMSE) of 0.5968, and a mean absolute error (MAE) of 0.4593. Performance metrics on the training set (R²=0.8978) also indicated a good model fit, and the gap between training and test set performance was within an acceptable range, suggesting that overfitting was effectively controlled. **Conclusion:** This study successfully constructed an XGBoost-based prediction model for JAK2 inhibitor pIC50. Utilizing easily accessible molecular descriptors, the model demonstrated high prediction accuracy and robustness on an external test set. This model holds promise as an efficient virtual screening tool to aid the early discovery and optimization process of JAK2 inhibitors.

Keywords

JAK2
pIC50
Machine Learning
XGBoost
Quantitative Structure-Activity Relationship (QSAR)
Drug Discovery

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.