Abstract
**Background:** Janus Kinase 2 (JAK2) is a key kinase in cellular signal transduction. Its abnormal activation is closely related to various myeloproliferative neoplasms and inflammatory diseases. Developing selective JAK2 inhibitors is an important direction in drug discovery. Accurate prediction of compound inhibitory activity (pIC50) against JAK2 is crucial for accelerating the discovery and optimization of lead compounds. **Objective:** This study aims to utilize public resources from the ChEMBL database, combined with machine learning methods, to build a computational model capable of efficiently and accurately predicting the pIC50 values of JAK2 inhibitors. **Methods:** We collected compounds targeting human JAK2 (ChEMBL ID: CHEMBL2971) and their IC50 (nM) activity data from the ChEMBL database. After data cleaning (retaining only precise values with `standard_relation = '='`) and standardization (converting IC50 to pIC50, retaining the average pIC50 for duplicate compounds), a dataset containing 5546 compounds was finally obtained. RDKit (version 2022.9.5) was used to calculate Morgan fingerprints (radius=2, 2048 bits), MACCS Keys fingerprints (167 bits), and 13 physicochemical and topological descriptors. Based on feature importance calculated during the data processing phase (derived from preliminary model evaluation), the top 350 features were selected. However, due to the absence of some features in the current dataset, the final model used **345** features. The dataset was randomly split into training (n=4436) and test sets (n=1110) at an 80:20 ratio. The XGBoost (eXtreme Gradient Boosting, version 3.0.0) algorithm was used to build the prediction model, and hyperparameters (`learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`) were optimized using 5-fold cross-validation and GridSearchCV. An early stopping strategy was employed during the final model training to prevent overfitting. **Results:** After hyperparameter optimization, the final XGBoost model demonstrated good predictive performance on the independent test set, achieving a coefficient of determination (R²) of 0.7184, a root mean square error (RMSE) of 0.5968, and a mean absolute error (MAE) of 0.4593. Performance metrics on the training set (R²=0.8978) also indicated a good model fit, and the gap between training and test set performance was within an acceptable range, suggesting that overfitting was effectively controlled. **Conclusion:** This study successfully constructed an XGBoost-based prediction model for JAK2 inhibitor pIC50. Utilizing easily accessible molecular descriptors, the model demonstrated high prediction accuracy and robustness on an external test set. This model holds promise as an efficient virtual screening tool to aid the early discovery and optimization process of JAK2 inhibitors.