Abstract
Protein kinases are a protein family that play an important role in several complex
diseases such as cancer, cardiovascular and immunological diseases. Kinases have conserved
binding sites, which when targeted can lead to similar activities of inhibitors against different
kinases. This can be exploited to create multi-target drugs. On the other hand, selectivity
(lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast
amount of kinase activity data in the public domain, which can be used in many different
ways. Multi-task machine learning models are expected to excel for these kinds of datasets
because they can learn from implicit correlations between tasks (in this case activities
against a variety of kinases). However, multi-task modelling of sparse data poses two
major challenges: (i) creating a balanced train-test split without data leakage and (ii)
handling missing data. In this work, we construct a kinase benchmark set composed of two
balanced splits without data leakage, using random and dissimilarity-driven cluster-based
mechanisms, respectively. This data set can be used for benchmarking and developing
kinase activity prediction models. Overall, the performance on the dissimilarity-driven
cluster-based split is lower than on random splits based sets for all models, indicating poor
generalizability of models. Nevertheless, we show that multi-task deep learning models,
on this very sparse dataset, outperform single-task deep learning and tree-based models.
Finally, we demonstrate that data imputation does not improve the performance of (multitask)
models on this benchmark set.
Supplementary materials
Title
SI
Description
S1 Kinase200 - split analysis
S2 Kinase1000 - split analysis
S3 Kinase200 - default model performance
S4 Kinase1000 - default model performance
S5 pQSAR model validation
Actions