Large-scale modelling of sparse kinase activity data

27 January 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Protein kinases are a protein family that play an important role in several complex diseases such as cancer, cardiovascular and immunological diseases. Kinases have conserved binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multi-target drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of kinase activity data in the public domain, which can be used in many different ways. Multi-task machine learning models are expected to excel for these kinds of datasets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multi-task modelling of sparse data poses two major challenges: (i) creating a balanced train-test split without data leakage and (ii) handling missing data. In this work, we construct a kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This data set can be used for benchmarking and developing kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random splits based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multi-task deep learning models, on this very sparse dataset, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multitask) models on this benchmark set.

Keywords

Kinases
Benchmark set
QSAR
multi-task modelling
machine learning

Supplementary materials

Title
Description
Actions
Title
SI
Description
S1 Kinase200 - split analysis S2 Kinase1000 - split analysis S3 Kinase200 - default model performance S4 Kinase1000 - default model performance S5 pQSAR model validation
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.