KiSSim: Predicting oﬀ-targets from structural similarities in the kinome

Protein kinases are among the most important drug targets because their dysregulation can cause cancer, inﬂammatory, and degenerative diseases. Developing selective inhibitors is challenging due to the highly conserved binding sites across the roughly 500 human kinases. Thus, detecting subtle similarities on a structural level can help to explain and predict oﬀ-targets among the kinase family. Here, we present the kinase-focused and subpocket-enhanced KiSSim ﬁngerprint ( Ki nase S tructural Sim ilarity). The ﬁngerprint builds on the KLIFS pocket deﬁnition, composed of 85 residues aligned across all available protein kinase structures, which enables residue-by-residue comparison without a computationally expensive alignment. The residues’ physicochemical and spatial properties are encoded within their structural context including key subpockets at the hinge region, the DFG motif, and the front pocket.


Introduction
Protein kinases are involved in most aspects of cell life due to their role in signal transduction. Their dysregulation can cause severe diseases such as cancer, inflammation, and neurodegeneration, 1 which makes them a frequent target of drug discovery campaigns. In 2015, 30% of FDA-approved small molecules targeted kinases. 2 The roughly 500 kinases in the human genome share a highly conserved binding site, which challenges selective drug design for a single kinase or a well-defined set of kinases (polypharmacology) avoiding binding to undesired off-targets. 3,4 Protein kinases bind adenosine triphosphate (ATP) to catalyze the transfer of its phosphate group to serine, threonine, or tyrosine residues of themselves or other proteins. ATP and most other ligands bind to the front cleft of the kinase pocket that lays between the two kinase domains, the C-and N-terminal lobes. These domains are connected via a hinge region, which is forming important hydrogen bonds to ATP as well as most studied ligands.
The gate area contains the conserved DFG (aspartate-phenylalanine-glycine) motif, whose phenylalanine flips in and out of the front pocket, opening and closing a hydrophobic region in the back cleft, i.e., the DFG-in and DFG-out conformation, respectively. The back cleft also comprises the αC-helix with a conserved glutamine residue, which forms a salt bridge with a conserved lysine residue in the gate area. Such a conformation is called αC-in as opposed to αC-out. 5 Researchers have studied kinase similarity between the full -or parts of the -kinome from many different angles. Manning et al. 6  While sequence comparison -and thus, evolutionary similarity -can explain many observations from kinase profiling experiments, other more distantly related off-targets remain undetected. For example, profiling Erlotinib against 48 kinases revealed high affinity against the on-target EGFR (TK group) but also the non-TK off-targets SLK, LOK, and GAK; 8 or the chemical probe SGC-STK17B-1 binds both DRAK2 and CaMMK, 9 although they are dissimilar when judged solely by their sequence. 6 Focusing on the kinase pocket instead of the whole sequence already helps: The 50 most similar kinases to EGFR are only TK kinases when ranked by full-length sequence while listing non-TK kinases when considering the pocket sequence only. 10 The KinCore phylogenetic tree produced by a kinome-wide structureguided MSA 7,11 overall confirms the assignment from Manning et al. 6 but provides higher precision, e.g. regarding previously unassigned kinases. Schmidt et al. 12 have recently investigated the similarities between a panel of nine kinases -EGFR, ErbB2, PIK3CA, KDR, BRAF, CDK2, LCK, MET, and p38a -based on different pocket encodings, including the pocket sequence identity, pocket structure similarity, interaction fingerprint similarity, and ligand promiscuity. Individual kinase relationships differed according to these different perspectives, while some trends could be observed such as the atypical kinase PIK3CA being an outlier amongst the otherwise typical kinases in this panel.
In an attempt to facilitate computer-aided kinase similarity studies, we here aim to add another perspective. Binding site comparison methods employed so far can be applied to any binding site regardless of the protein class. Kuhn et al. 13 have applied such a method, Cavbase, to the structurally resolved kinome and could detect expected and unexpected kinase relationships. Since kinases are highly conserved and have been aligned and annotated across the full structurally covered kinome, a binding site comparison method tailored to kinases may provide an extended perspective on kinase similarities. We make use of data in the KLIFS 14 database, a rich resource for kinase research that extracts protein kinasefocused information on structures from the PDB, 15 on inhibitors in clinical trials from the PKIDB, 16 on bioactivities from ChEMBL, 17 and much more. All kinase structures from the PDB are split into single chains and models and aligned with respect to sequence and structure across the full structurally covered kinome. The KLIFS authors defined the kinase pocket as a set of 85 residues that interact with co-crystallized ligands in the initial KLIFS dataset of more than 1200 structures. 5 Thanks to this structural alignment, it is possible to look up all 85 residues in any kinase structure, given the residue is structurally resolved and not in a gap position. This pocket alignment is the basis for the here introduced KiSSim fingerprint.
The kinase-focused and subpocket-enhanced KiSSim (Kinase Structural Similarity) fingerprint builds on the KLIFS 14 pocket, whose alignment allows a computationally inexpensive residue-by-residue comparison. The residues' physicochemical and spatial properties are encoded within their structural context including important kinase subpockets -the hinge region, DFG region, and front pocket -building on features from previously published methods such as SiteAlign, 18 KinFragLib, 19 and Ultrafast Shape Recognition (USR). 20 We used the fingerprint to calculate all-against-all similarities within the structurally covered kinome and to generate a KiSSim-based kinome tree. Detected similarities can be used to predict off-targets or guide polypharmacology studies and to rationalize profiling observations on a structural level. We distribute the method as an open source Python package at https://github.com/volkamerlab/kissim and as conda package, alongside the data and analyses notebooks at https://github.com/volkamerlab/kissim_app to support FAIR 21 science.

Methods & Data
In the following, we outline the KiSSim methodology and implementation, the datasets used, and the method's evaluation. All data, fingerprints, and analyses are available at https://github.com/volkamerlab/kissim_app.

KiSSim methodology
The KiSSim methodology consists of three steps: the encoding of a set of kinase binding sites as KiSSim fingerprints (Figure 1), the all-against-all comparison of these structures using their fingerprints, and -since one kinase can be represented by multiple structures -the mapping of multiple structure/fingerprint pairs to one kinase pair.

Encoding: From structure to fingerprint
The KiSSim fingerprint encodes the 85 KLIFS pocket residues in the form of physicochemical and spatial properties as illustrated in Figure 1. We summarize the encoding procedure in the following; for a detailed description please refer to the Supplementary methods section.  Figure 1: KiSSim fingerprint encodes physicochemical and spatial properties of kinase pockets. The fingerprint builds on the KLIFS 14 pocket definition, i.e. 85 residues aligned across all available protein kinase structures, which enables residue-by-residue comparison without a computationally expensive alignment. Each residue is encoded physicochemically and spatially. Physicochemical properties include the following features per residue (example: phenylalanine/PHE): (a) Pharmacophoric features and size categories are taken from the SiteAlign 18 binding site comparison methodology. (b) Side chain orientation is adapted from SiteAlign and defined as inward-facing, intermediate, or outwards-facing depending on the vertex angle between the pocket centroid, the residue's side chain representative (Table S3), and CA atom. (c) Solvent exposure is defined as high, intermediate, or low, depending on the ratio of CA atoms in the upper half of a sphere cut in half by a normal plane spanned by the residue's CA-CB vector. The implementation is based on BioPython's HSExposure. 22,23 Spatial properties are defined as follows: (d) Each residue's distance to the pocket center and important kinase subpockets, i.e., the hinge region, DFG region, and the front pocket. On the right, example locations are shown in the 3D representation of kinase EGFR (PDB ID: 2ITO, KLIFS structure ID: 783). (e) The distance distributions per pocket center and subpocket are furthermore described by their first three moments, i.e. the mean, standard deviation, and skewness. Pharmacophoric and size features are taken from the SiteAlign categories for standard amino acids. 18 They encode the size based on the number of heavy atoms, the number of hydrogen bond donors (HBD) and hydrogen bond acceptors (HBA), the charge (negative, neutral, or positive), and aromatic and aliphatic properties (present or not present) of a residue (Table S1). The side chain orientation (inward-facing, intermediate, or outward-facing) is based on the vertex angle from the residue's CA atom (vertex) to the pocket center and to the residue's outermost side chain atom, the side chain representative (Table S3). The solvent exposure of a residue (high, intermediate, or low) is based on the ratio of CA atoms in the upper half of a sphere that is placed around the residue's CA atom (radius 12Å) and cut in half by a normal plane spanned by the residue's CA-CB vector, as implemented in BioPython's HSExposure module. 22,23 Spatial properties are described by discrete values, i.e., distances and moments. Spatial distances are calculated from each residue's CA atom to the pocket's geometric center and to prominent subpocket centers. The pocket center is the centroid of all pocket CA atoms. The selected subpocket centers include functionally well-characterized kinase regions such as the hinge region, DFG region, and front pocket. Each subpocket center is calculated based on the centroid of three anchor residues' CA atoms (Table S4), following the idea described in the KinFragLib methodology. 19 We added the code to calculate the subpocket centers to the structural cheminformatics library OpenCADD (module opencadd.structure.pocket) 24 to allow for easy access in other projects. Spatial moments describe each of the four distributions of distances to the pocket center, hinge region, DFG region, and front pocket. In KiSSim, the first three moments are used: the mean, the standard deviation, and the cube root of the skewness. This procedure is inspired and adapted from the ligand-based Ultrafast Shape Recognition (USR) 20 method.
Fingerprint length. The final full-length fingerprint encompasses eight discrete physicochemical features (8 features x 85 residues), four continuous spatial distance features (4 features x 85 residues), and three continuous spatial moment features (3 moments x 4 distributions), resulting in a 1032 bit vector. Optionally, a subset of residues can be selected to generate a subset fingerprint emphasizing certain residues. We offer a subset of residues that is based on frequently interacting co-crystallized ligands, 25

Pairwise structure comparison
Two kinase pocket structures -encoded as two fingerprints -can be compared in two steps ( Figure 2). First, we calculate for each feature the distance between the corresponding two feature vectors across the 85 residue entries, producing a feature distances vector of length 15 (i.e., aggregating over the columns in Figure 2 a). For example, the two fingerprints' 85-bit size feature vectors -representing the size of each of the 85 pocket residues -will be reduced to a single size feature distance. The distance between discrete features is defined as the scaled L1 norm x 1 = 1 n n i=1 |x i | (scaled Manhattan distance), whereas the distance between continuous features is defined as the scaled L2 norm , where x is a vector of length n. 27 (1)

Kinome-wide comparison
The kinome-wide comparison is based on an all-against-all comparison of all available structures. Note that a kinase can be represented by multiple structures (see KLIFS data section), thus, a kinase pair can be represented by multiple structure pairs with multiple distance values. Our final goal is to assign one distance value to each kinase pair as a measure of the similarity between these two kinases (Figure 2 b). The structural coverage of kinases is highly imbalanced: Some kinases are represented by one structure only, others like EGFR or CDK2 by more than 100. We select the structure pair with the lowest distance as representative for the kinase pair, hence always picking the two closest structures in the dataset. For example, if a dataset consists of ten structures representing three kinases, the 10 × 10 all-against-all structure distance matrix will be reduced to a 3 × 3 all-against-all kinase distance matrix, consisting of the lowest distance values only after mapping structure pairs to kinase pairs.

Fingerprint and similarity visualization in 3D
Fingerprint features can be visualized in 3D using the NGLviewer 28,29 and IPyWidgets 30 for the following applications: (a) Fingerprint features of a structure can be visualized in 3D by coloring the residues by different feature values. (b) The difference between two structures can be highlighted to spot positions of high or low similarity between two structures.
The differences are shown for each feature type individually. (c) The standard deviation of spatial features between all structures available for one kinase can be mapped onto an example structure in 3D to show regions of high or low variability between different kinase conformations.

KiSSim tree
The kinase distance matrix produced as described in the Kinome-wide comparison section is submitted to a hierarchical clustering as implemented in SciPy 31 using as metric the Euclidean distance and as linkage Ward's criterion. We generate a phylogenetic tree in the Newick format based on this KiSSim kinase clustering. The tree branches are labeled with the mean of all distances belonging to that branch; the tree leaves are annotated with the kinase names and their assigned Manning kinase groups. We visualize the tree in an automatized way using BioPython's Phylo 22,32 module to be used in Jupyter Notebooks, and in a manual way using the freely available FigTree 33 software to produce publication-ready circular trees.

KiSSim implementation
The kissim library is implemented as an open-source Python package, which is available on GitHub at https://github.com/volkamerlab/kissim and as conda package at condaforge. 34,35 Structures are retrieved via the OpenCADD-KLIFS module 24 and are encoded as fingerprints using the FingerprintGenerator class; fingerprints can be compared using the FingerprintDistanceGenerator class. We also offer quick access encode and compare functionalities as Python API and as command-line interface (CLI), see Figure 3. Lastly, the kissim.encoding.tree module offers an automatized all-against-all clustering and phylogenetic tree generation, while the 3D visualization of fingerprints and pairwise comparisons is implemented in the kissim.viewer module.
Structural data is read and processed with BioPython 22 and BioPandas; 36 computation is performed with NumPy, 37 Pandas, 38 SciPy, 39 and Scikit-learn. 40 The code for operations that are of use outside of the KiSSim project has been added to the OpenCADD library: 24 KLIFS queries are implemented in the OpenCADD-KLIFS module and subpocket centers can be defined and visualized with the OpenCADD-pocket module.
All code is written in Python 3 41 Figure 3: The kissim library's Python API and CLI. Structures from the KLIFS database can be encoded as fingerprints using the FingerprintGenerator class (details in Figure 1) and compared using the FeatureDistancesGenerator and FingerprintDistanceGenerator class (details in Figure 2). The package offers the wrappers encode and compare for quick and easy access from within a Python script (Python API) or from the command line (CLI). Please also refer to the kissim library's documentation at https://kissim.readthedocs.io.

Data
We are using the following sources of external data: KLIFS kinase structures 14 and the profiling datasets by Karaman et al. 8 and Davis et al. 53 , filtered and processed as described in the following. All prepared datasets described here are accessible via the src.data module at https://github.com/volkamerlab/kissim_app.

KLIFS data
We downloaded the human structural kinase dataset from the KLIFS database version 3.2 14 on 2021-09-02. This dataset contained 11806 human monomeric structures, i.e., PDB entries split into monomeric structures if consisting of multiple chains and alternate models.
We filtered the dataset for human kinases with a resolution ≤ 3Å and a KLIFS quality score ≥ 6. The KLIFS quality score ranges from 0 (bad) to 10 (flawless) and describes the quality of the structural alignment and resolution regarding missing residues and atoms. In addition, we excluded structures with more than three pocket mutations or with more than eight missing pocket residues. In order to reduce computational costs, we selected the best structure per kinase in each PDB entry (kinase-PDB pair); the best structure per kinase-PDB pair is defined as the structure with the least missing pocket residues, the least missing pocket atoms, the lowest alternate model identifier, and the lowest chain identifier (in that order). Structures were excluded if they are flagged as problematic structures in KLIFS and if they could not be encoded as KiSSim fingerprint. We produced three final datasets of structures for KiSSim fingerprint generation and all-against-all comparison: structures in any DFG conformation, DFG-in conformation only, and DFG-out conformation only. Table 1 lists the number of structures remaining after each filtering step.

Bioactivity profiling data
To compare predicted and measured on-and off-targets, we use two kinase bioactivity datasets available through KinMap: 56 The Karaman et al. 8  and 442 kinases, respectively. The lower the K d value, the higher the binding affinity, which is used as a proxy for activity. We pooled data from both datasets by taking the union of all kinase-ligand pairs. If kinase-ligand pairs have bioactivity values in both datasets, Evaluation prints (IFPs), and SiteAlign. 18 All prepared datasets and evaluation strategies described here are accessible via the src.data and src.evaluation modules at https://github. com/volkamerlab/kissim_app.

KiSSim evaluation using profiling data
To evaluate how well KiSSim detects kinase similarities, we need to define a ground truth of kinase similarities. We use profiling data as a surrogate for this, since it is safe to assume that kinases that are targeted by the same ligand share similar binding sites.
To this end, we use the profiling Karaman-Davis dataset, which describes the activity of ligands against a panel of kinases. We assign each ligand l i in the profiling dataset to their reported key target(s) k j (l i ) in the PKIDB, 16 ranging from one target to multiple targets, e.g. Erlotinib is assigned to EGFR only while Imatinib binds to ABL1, KIT, RET, TRKA, 2. We rank all kinases by their KiSSim distance to EGFR. These are our KiSSim-based kinase similarities.
3. We calculate ROC curves to demonstrate how well the profiling data is predicted by our KiSSim-based kinase similarities.
Some kinase activities measured in the profiling dataset are rather unexpected from a sequence-based similarity point of view. For the EGFR-Erlotinib example, we use the KinMap server to plot the profiling-based and KiSSim-based ranked kinases onto the kinome tree by Manning et al. 6 . For example, we highlight kinases with measured activities against Erlotinib as well as the 50 most similar kinases to EGFR as detected by KiSSim. All kinases that are part of the KiSSim dataset are shown as well to define which data points are available for similarity predictions.

KiSSim comparison to other methods
We outline here the preparation of all-against-all kinase distance matrices based on different similarity measures to be compared to the KiSSim kinase distance matrix (KiSSim Jaccard distance is used to compare the IFPs. If multiple IFP pairs describe the same kinase pair, we selected the minimum distance as the representative measure for the kinase pair, following the same procedure as described for the KiSSim methodology. SiteAlign. We performed an all-against-all comparison using the pocket comparison method SiteAlign 18 (version 4.0). In this approach, properties of a binding site are projected to a triangulated sphere positioned at the pocket center, stored as a fingerprint to be compared and aligned to another binding site fingerprint iteratively. Since we used the existing KLIFS alignment, a few SiteAlign parameters were adapted to reduce runtime: we decreased the number of alignment steps in SiteAlign from 3 to 1, the translational steps from 5 to 3, and reduced the rotational and translational intensity from 2π to 1 4 π and from 4 to 1, respectively. Comparison of the SiteAlign performance for > 4000 structure pairs with the default and adjusted settings, showed that the adjusted settings resulted in lower distances (average decrease of 6%), while matching a higher number of triangles (average increase of 15%). Pocket residues with modifications (e.g. phosphorylated threonines) were excluded to avoid segmentation faults.

Results and Discussion
We present here the generated KiSSim dataset and the resulting KiSSim-based kinome tree. Furthermore, we evaluate the KiSSim results in comparison to profiling data (KiSSim evaluation using profiling data section) and other pocket encoding methods (KiSSim comparison to other methods section).

KiSSim dataset
KLIFS structures are filtered as described in detail in the KLIFS data section (Table 1), then encoded and compared as described in the KiSSim methodology section. When considering structures in DFG-in conformations only, 4112 fingerprints representing 257 kinases result in a 4112 × 4112 structure distance matrix and -after mapping structure to kinase pairs as described in the Kinome-wide comparison section -in a 257 × 257 kinase distance matrix (Table 1).

Fingerprint feature value distribution
The KiSSim fingerprint encodes the 85 KLIFS pocket residues in the form of physicochemical and spatial properties. Physicochemical properties include pharmacophoric and size features, side chain orientation, and solvent exposure; spatial properties include each residue's distance to the pocket center as well as to three subpockets and the first three moments of the resulting distance distributions (Figure 1). We investigate here the fingerprint feature value distribution across all KiSSim fingerprints.  Figure S3). Distances from subpocket centers to regions such as the G-rich loop (residues 4-9), the αC-helix (residues 20-30), and the DFG motif vary more than for example to the hinge region, which agrees with knowledge on more flexible vs. more stable regions in the kinase pocket. The spatial moment features describe the distance distributions between the pocket residues to the subpocket centers. They show lower variability for the mean and the standard deviation but high variability for the skewness (Figure 4a, right).
The spatial features are based on the KiSSim subpockets as described in the Encoding: From structure to fingerprint section. These subpockets are calculated for each structure individually, however, show robustness over the structural kinome. The subpocket centers occupy the same space across the aligned KLIFS structures, while the front pocket and DFG region center show higher variability than the hinge region and pocket center (Figure 4b), as to be expected. Therefore, the subpocket definition procedure seems to be robust enough to span comparable subpocket centers while fine-grained enough to encode structural differences.
In conclusion, the feature space encoded in the KiSSim fingerprint, on the one hand, reflects sequence-related similarities between kinases on a generalized level through the defined physicochemical properties and, on the other hand, incorporates information on flexible and stable regions through the defined spatial properties.

Fingerprint distances to compare structures
Moving on from the structure encoding (fingerprints) to the structure comparison (fingerprint distances), we aimed to explore if the KiSSim fingerprint can be used to discriminate between kinases and between DFG-in and DFG-out conformations.
First, we measured the discriminating power between kinases by comparing KiSSim fingerprint distances between DFG-in structures of the same kinase and of different kinases, i.e. intra-kinase and inter-kinase distances, respectively. With a median of 0.02 compared to 0.11, the (about 200000) intra-kinase distances are significantly lower than the (about 8.2 million) inter-kinase distances as shown in Figure 5a, indicating that the fingerprint can discriminate between kinases. Note that the distances between structure pairs describing the

KiSSim-based kinome tree
Structure is known to be more conserved than sequence, 64 and previous studies have shown that including structural information adds orthogonal information to shed light on unexpected similarities between kinases and off-target effects. 7,12 To help detect such relationships between more distantly related kinases, we generated KiSSim kinome trees based on the DFG-in conformations, as described in detail in the KiSSim tree section, to investigate all-against-all relationships between kinases compared to the sequence-based kinome tree by Manning et al. 6 . Note that we can base the comparison on structurally resolved Kinases from the STE group are assigned mostly to a single cluster that is, however, shared with kinases from many other kinase groups. The STE kinases MAP2K [1,4,6,7] and OSR1 are separated from the other STE kinases.
Kinases from the CMGC group are clustered in two subgroups: kinases from the CDK, CDKL, and MAPK families build one cluster, while kinases from the DYRK, SRPK, and CLK family build another. The CK2a2 kinase (CK2 family) is an outlier.
Kinases from the TKL group are mainly clustered together with kinases from the Other group but some are separated from the rest (DLK, BRAK, IRAK2, and LIMK1). Kinases from the CK1 group build one group except for TTBK1 and TTBK2. Kinases from the AGC group cluster together as well; MSK1 is the only outlier that is found closer to the CAMK kinases. Lastly, only three atypical kinases are included in the KiSSim dataset (ADCK3, RIOK1, and RIOK2) and build their own cluster, neighboring to the CK1 kinases.
Overall, the KiSSim dataset retrieves the sequence-based kinome tree by Manning et al. 6 , including subbranches as discussed for the kinases assigned to the TK and CMGC groups.
This is not surprising because we do encode the sequence in an abstracted manner in the physicochemical KiSSim fingerprint bits. However, some kinases show deviating relationships, of which some can be rationalized such as the CaMKK2 and DRAK2 relationship shown also in profiling data. Thus, the addition of structural information in the KiSSim fingerprint allows us to cluster more distantly related kinases. This aspect of the KiSSim tree is of interest because it predicts novel information on kinase similarities.

KiSSim evaluation using profiling data
As discussed, the KiSSim tree shows expected and unexpected kinase (dis)similarities.
In order to evaluate the specificity and sensitivity of our method, we use profiling data as a surrogate for (real) expected kinase (dis)similarities: if a ligand targets a set of kinases with high activity, these kinases have similar binding sites and are therefore treated as similar kinases.
To this end, we pooled the Karaman et al. 8 and Davis et al. 53 datasets and filtered for FDA-approved inhibitors and their targets as listed in the PKIDB. 16 The dataset preparation is described in detail in the Bioactivity profiling data section. We show the KiSSim method's performance in the form of ROC curves for each inhibitor's listed targets.
For example, Imatinib has three reported on-targets (assigned in PKIDB) and two offtargets (based on activity data in the Karaman-Davis dataset); KiSSim's performance is

Comparison of KiSSim to other methods
In the next step, we investigated all-against-all comparisons based on the KiSSim fingerprints, the KLIFS pocket sequence, KLIFS ligand-pocket interaction fingerprints (IFP), and the SiteAlign scores. The data preparation steps are described in detail in the KiSSim comparison to other methods section.
The KiSSim fingerprint contains physicochemical bits, which generalize the pocket sequence, and spatial bits, which consider the individual atom/residue positions in the under-  (a) Highlight residues with at least one large difference in their physicochemical bits (∆d normalized = 0.6, blue), spatial bits (∆d normalized = 0.2, yellow), or both (green). Color residues by their differences in their (b) HBA, (c) aliphatic, and (d) hinge region feature, ranging from no difference (white) to highest difference (blue). See notebook for more details. 72 lying kinase conformations. First, we use the KLIFS pocket sequence (KLIFS seq) to probe if the KiSSim fingerprint's generalized sequence and spatial information improve predictions compared to sequence information only. Second, we use the KLIFS pocket IFP (KLIFS IFP ) to probe if the KiSSim fingerprint, which does not contain any information about interactions, improves kinase similarity predictions compared to interaction-based fingerprints.
The advantage of IFPs is that they emphasize important residues and interactions as seen based on one or more ligands; the disadvantage is that not all possibly relevant interactions have been seen, yet. Note that combining the IFP information with KiSSim -using only interacting residues in the KiSSim fingerprint -can improve the KiSSim performance as discussed in the KiSSim evaluation using profiling data section. Third, we use kinase similarities calculated with the SiteAlign methodology (SiteAlign), from which we adapted some of the physicochemical KiSSim features, to confirm that the KiSSim fingerprint adds relevant kinase-focused information.
Correlation. We compared the pairwise kinase distances between the four different method setups ( Figure S9). We observed a rather strong correlation between the KiSSim Performance. We performed the same profiling analysis, which we discussed for KiSSim (mean AUC 0.75±0.12) in the KiSSim evaluation using profiling data section, for the KLIFS seq (mean AUC 0.78 ± 0.15), KLIFS IFP (mean AUC 0.63 ± 0.12), and SiteAlign (mean AUC 0.71 ± 0.12) datasets, see Figure 7.
The KiSSim approach performs slightly worse compared to the KLIFS pocket sequence comparison in case of ligands like Imatinib, whose reported on-targets all belong to the TK group, but shows better performance for Erlotinib, Bosutinib, and Doramapimod, which have known kinase targets belonging to different kinase groups. Hence, while the sequencebased approach picks up kinase group assignments as to be expected, KiSSim picks up more distant and less obvious off-targets.
The KLIFS pocket IFP comparison performs similarly to the KiSSim comparison in the case of Erlotinib, however, worse for the other three ligands. In contrast to the KiSSim approach, pocket similarities can only be detected by the IFP approach if the respective kinases have been co-crystallized with ligands that form similar interaction patterns. Such an IFP-based comparison probably can be more successful for a defined kinase set with high coverage of co-crystallized ligands in contrast to a kinome-wide comparison as performed here.
The SiteAlign methodology projects topological and chemical properties onto a sphere that sits in the center of a protein pocket. The spheres are aligned based on these projections and a similarity score is calculated between the aligned fingerprints. Finding the right alignment is a time-consuming step, hence we offered SiteAlign already the KLIFS-aligned structures as a starting point and reduced the iterations as described in the KiSSim comparison to other methods section. KiSSim outperforms the SiteAlign results in most cases, however, often not considerably much. Taking all these findings together, the KiSSim methodology compares well with established methods while often improving predictions between kinase pairs without an obvious relationship based on the sequence. The pocket sequence and IFP based methods are much faster than the structure-based methods KiSSim and SiteAlign, however, the overall kinase similarity assessment benefits from the added structural pocket information. KiSSim's setup and runtime are more convenient than for the SiteAlign method, however, KiSSim does rely on the KLIFS 85-residue pocket alignment.

Conclusion
We presented here the KiSSim (Kinase Structural Similarity) fingerprint as a novel structure-enabled pocket encoding tailored to kinase pockets. The fingerprint encodes physicochemical and spatial properties of the 85 KLIFS residues, which are aligned across the structurally covered kinome. On the one hand, the majority of physicochemical bits -size, HBD, HBA, charge, aromatic, and aliphatic, which are adapted from the SiteAlign method -encode the pocket sequence in a generalized, pharmacophoric way. On the other hand, the side chain orientation, solvent exposure, and the spatial bits -the distances to the pocket center and key subpocket centers and the distance distributions' moments -account for the structural conformation. Across all fingerprints, we saw that the fingerprint captures the physicochemical property variability (e.g., most residues are uncharged, whereas HBD/HBA features vary) and the conserved residue positions (e.g., distances to DFG region are more widely spread than to the hinge region).
We used the fingerprint to calculate all-against-all distances -small distances refer to high similarity, large distances to low similarity -within the structurally covered kinome: the DFG-in and DFG-out dataset consist of 4112 and 406 structures, representing 257 and 71 kinases, respectively. We found that the fingerprint can distinguish between intra-and inter-kinase similarities and between DFG-in and DFG-out structures. Some kinases are represented by multiple structures, hence some kinase pairs are represented by multiple structure pairs. The distribution of structure distances for one kinase pair can be broad; we selected per kinase pair the closest structure pair that is experimentally observed. We clustered the resulting kinase distance matrix to produce a KiSSim-based kinome tree. While the tree reproduced large parts of the sequence-based Manning tree, some relationships could be observed that are unexpected from a sequence perspective only. For example, we found similarities between CaMKK2 (STE) and DRAK2 (CAMK), which are targeted by the same chemical probe SGC-STK17B-1; 9 we also could confirm the reassignment of AurA, AurC, PLK4, and CaMMK2 from the Other to the CAMK group as proposed by Modi and Dunbrack 7 .
Besides the averaged tree view, we also investigated the top-ranked kinases given a query kinase to show that KiSSim can partially explain profiling data. While some ligand profiles are reflected completely in the KiSSim dataset (e.g., Imatinib), other ligand profiles are covered partially (e.g., Erlotinib's off-targets LOK and SLK are detected while GAK is not).
In comparison with other similarity measures -focusing on the pocket sequence (KLIFS seq), interaction profiles (KLIFS IFP ), or topological-and chemical pocket properties (SiteAlign) -KiSSim performs equally or slightly better in most cases. The sequence-and IFP-based measures are easy and fast to compute thanks to the preprocessed kinase pockets available at KLIFS; we recommend to include these datasets in any case when investigating kinase similarities. SiteAlign is a powerful tool to compare pockets across all protein classes; if interested only in kinases, KiSSim is a kinase-focused and faster alternative with slightly better results in most of the investigated cases.
As for all structure-based methods, the imbalanced dataset of kinase structures is a challenge. Some kinases are structurally well represented (e.g., EGFR or CDK2), while others have only few structures available. And unfortunately still roughly half of the humane kinome has no structural information available at all. The recent breakthrough of AlphaFold2 75 could help here; predicted structures for almost all human kinases are available now on the AlphaFold DB. 76 Modi and Dunbrack 77 have already classified the structures' conformations and found most structures in the DFG-in conformation. An AlphaFold-enhanced KiSSim tree may further increase the usefulness of the KiSSim methodology for kinome-wide similarity studies. Furthermore, the KiSSim fingerprint can be applied in machine learning, e.g. to extract the most important features in the kinase pocket.
We believe that the KiSSim fingerprint is a valuable tool for kinase research to explain and predict off-targets and polypharmacology. Since the code is open sourced and available as Python package, the KiSSim fingerprint can easily be integrated in other larger-scale workflows.
Code and data availability