KinFragLib: Exploring the Kinase Inhibitor Space Using Subpocket-Focused Fragmentation and Recombination

Protein kinases play a crucial role in many cell signaling processes, making them one of the most important families of drug targets. In this context, fragment-based drug design strategies have been successfully applied to develop novel kinase inhibitors, usually following a knowledge-driven approach to optimize a focused set of fragments to a potent kinase inhibitor. Alternatively, KinFragLib is a new method that allows to explore and extend the chemical space of kinase inhibitors using data-driven fragmentation and recombination, built on available structural kinome data from the KLIFS database for over 2,500 kinase DFG-in complexes. The computational fragmentation method splits the co-crystallized non-covalent kinase inhibitors into fragments with respect to their 3D proximity to six predeﬁned functionally relevant subpocket centers. The resulting fragment library consists of six subpocket pools with over 7,000 fragments, available at two main applications: (i) In-depth analyses of the chemical space of known kinase inhibitors, subpocket characteristics and connections, as well as (ii) subpocket-informed recombination of fragments to generate potential novel inhibitors. The latter showed that recombining only a subset of 624 representative fragments generated a combinatorial library of 6.7 million molecules, containing, besides some known kinase inhibitors, more than 99% novel chemical matter compared to ChEMBL and 63% molecules compliant with Lipinski’s rule of ﬁve.

which share a similar sequence and structure, and atypical protein kinases, which have biochemical kinase activity, but lack sequence similarity to the typical kinase domain. Furthermore, eukaryotic protein kinases can be classified based on their sequence identity into 8 main kinase groups: AGC, CAMK, CK1, CMGC, STE, TK, TKL, and Other. 1,6 Kinase structure. Protein kinase structures consist of two domains, i.e. the N-and C-lobes, connected via a hinge region. The majority of kinase inhibitors target the catalytic cleft between these lobes, which contains the highly conserved ATP binding site. Based on over 1,200 kinase-ligand crystal structures, van Linden et al. 7 have defined the binding site to comprise 85 residues and 19 well-defined regions/motifs, covering a (i) front cleft, (ii) gate area, and (iii) back cleft (important regions shown in Figure 1.B and their KLIFS 7,8 numbering in brackets in the following): (i) ATP solely occupies the front cleft, which contains the hinge region (46)(47)(48), linker (49)(50)(51)(52), glycine-rich loop (4)(5)(6)(7)(8)(9), and catalytic loop (68-75). ATP's adenine group as well as most kinase inhibitors form hydrogen bonds to the hinge region. (ii) The gate area contains the conserved DFG motif (81-83), the conserved lysine residue K17 (17), and the gatekeeper residue (45), often used for inhibitor selectivity, preceding the hinge region. (iii) The back cleft contains amongst others the αC-helix (20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), including the conserved glutamine residue E24 (24), which forms the conserved K17-E24 salt bridge in the so-called αC-in (as opposed to the αC-out) conformations. Furthermore, the DFG motif can undergo a significant conformational change, which results in an inactive state of the kinase (DFG-out instead of DFG-in conformation). This DFG-flip opens a hydrophobic region in the back cleft targeted by inhibitors stabilizing the inactive state. 7,9 The KLIFS database 7,8 has made this and further information about the available kinases structures, their bound ligands, and the interactions between them freely available.
Kinase inhibitors. Kinase inhibitors are classified by their binding modes. 10 Type I and II inhibitors occupy mainly the front cleft and form hydrogen bonds with the hinge region: Type I and I 1 /2 inhibitors bind to the active and inactive DFG-in conformation, respectively, whereas type II inhibitors stabilize the inactive DFG-out conformation. Allosteric inhibitors bind only next to the ATP binding site (type III) or outside of the catalytic cleft (type IV).
Type V inhibitors are bivalent, i.e. binding different regions simultaneously. While type I-V inhibitors bind reversibly, covalent inhibitors are classified as type VI.

Fragmentation and recombination of kinase inhibitors
Fragment-based drug discovery (FBDD) has been successfully applied to develop novel and selective compounds, including kinase inhibitors. 11,12 Fragments are low molecular weight compounds targeting a specific subpocket within the active site of a protein. They usually bind to their target with weaker activity than traditional drug-like molecules but with a good binding efficiency, i.e. a higher proportion of the atoms is interacting with the protein. 13,14 In drug design, molecules can be viewed as combinations of multiple fragments. Linking, replacing, or recombining fragments is the essence of FBDD. Fragments can be generated computationally by decomposing larger compounds. Clearly, the choice of the fragmentation technique will have an impact on the resulting fragment library. For example, RECAP (REtrosynthetic Combinatorial Analysis Procedure) 15 and BRICS (Breaking of Retrosynthetically Interesting Chemical Substructures) 16 aim to cut only synthetically meaningful chemical bonds. eMolFrag 17 builds on top of BRICS to generate a set of (larger) bricks and (smaller) connecting linkers.
Typically, FBDD starts with the screening of a fragment library to identify binders to specific targets, and only these hits are optimized into larger compounds by fragment linking or fragment growing. The screening step can be done experimentally or in silico. 18 In the context of kinase inhibition, Urich et al. 19 extracted ∼6,000 fragments with hinge-binding motifs from a kinase-unfocused library of 2.3 million compounds and docked them against 46 kinase structures to identify potential hinge binders. Fragment expansion of promising hits yielded a number of potent kinase inhibitors. Rachman et al. 20 reported a potent hinge binding fragment, selected from a kinase-unfocused fragment library (624 fragments) via docking against the JAK2 ATP binding site, filtered by (i) pharmacophoric restraints (re-strained docking) at the hinge region and (ii) interaction strength measured by the work necessary to break a defined hinge hydrogen bond (dynamic undocking). However, it is also possible to start off directly with a kinase-focused library of fragments that provide optimal interaction patterns with the ATP binding site. For instance, Mukherjee et al. 21 report the Kinase Crystal Miner to extract the smallest possible fragment in each kinase-ligand crystal structure with hydrogen bonds to the hinge region, yielding about 1,000 fragments from 2,250 ligands. Substructure searches for these fragments in large molecule databases supplied molecules with kinase binding potential. Note that all the aforementioned approaches make use of 3D structural information and focus on hinge-binding fragments to be used for fragment expansion or substructure searches in compound libraries.
An alternative approach is to decompose a compound library based on kinase-focused criteria and recombine the resulting fragments to a kinase-focused molecule library. Recently, Yang et al. 22 reported a ligand-based fragmentation and recombination strategy, which was applied on both a kinase-focused (194 kinase inhibitors from PKIDB 23 ) and a kinase-unfocused library of ∼4.6 million compounds. The fragments were assigned to three different fragment pools representing three designated parts of a kinase inhibitor, i.e. the core, connecting, and modifying fragments. Without using 3D structural information, fragments were assigned to the core fragment pool if a donor-acceptor hinge recognition pattern could be found. Enumerating different combinations of core-connector-modifying fragments yielded two virtual kinase-focused recombined molecule libraries (∼500,000 and ∼40 million recombined molecules), based on the aforementioned kinase-focused and kinase-unfocused input data.

KinFragLib methodology
KinFragLib, which is introduced here, takes advantage of the large amount of structural data on kinase ligands from KLIFS for subpocket-based fragmentation and recombination ( Figure 1). Organizing fragments from kinase ligands by subpocket allows not only to perform a detailed subpocket-specific analysis of their fragment space, but also to better understand the composition and spatial arrangement of reported kinase-ligand complexes. Moreover, this kinase-focused fragment library organized by subpocket allows for a specific and controlled fragment recombination, unveiling a completely unexplored territory in the chemical space of kinase inhibitors.

Data and Methods
The following sections describe the procedure for (1) collecting and preprocessing the dataset of kinase complex structures, (2) defining subpockets, (3) fragmenting each of the co-crystallized ligands in the dataset, (4) analyzing the fragment library, (5) recombining fragments, and (6) studying the combinatorial library.

(1) Data collection and preprocessing
Structures of kinase-inhibitor complexes were collected from the KLIFS database 7,8 (downloaded on 2019-11-06), which offers superimposed kinase structures from the PDB 24 with 85 residues defined as kinase binding site. In KLIFS, several entries can exist for one PDB code, since crystal structures were split into all existing alternate location models and all kinase-domain-containing chains of heteromeric protein complexes. Each KLIFS entry comes with details on the species, kinase, kinase group, PDB code of the complex and the ligand, sequence alignment of the 85 binding site residues, DFG conformation (in, out, or out-like), ligand position (within or outside the main pocket), and KLIFS quality score. The latter ranges from 0 (bad) to 10 (flawless) and describes the quality of the alignment as well as structure based on each structure's alignment to a reference as well as its number of missing residues and atoms, respectively.
The structural data is preprocessed as summarized in Table 1  Based on their location in the binding site, known kinase ligands are fragmented and placed into subpocket pools, which can than be used to generate a combinatorial library. (B) The kinase binding site is shown with important regions and the six defined subpocket centers as spheres (PDB:3W2S (EGFR)). (C) Schematic depiction of the six subpockets and the predefined allowed connections between these subpockets. Colors of the subpockets are matching in B and C. Finally, as the current approach focuses on the discovery of reversible inhibitors, covalent ligands were also excluded. These were identified by downloading the PDB file corresponding to the KLIFS structure and checking the CONECT records for any connection between the kinase and the ligand. Note that after personal communication with A. Kooistra, 27 two PDB entries were excluded manually (2clx, 4cfn), since the ligand was found to be not covalently bound; and three PDB entries (4d9t, 4hct, 4kio) were added, because the ligands bind covalently but the CONECT entries were missing (see full list of removed structures with covalent ligands in the SI). The dataset after preprocessing consists of 2,801 kinase-ligand structures. Further filtering steps during the fragmentation procedure as described in " (3) Molecule fragmentation" result in a final dataset of 2,553 complex structures (see Table 1).

(2) Subpocket definition and allowed connections
In this work, the kinase binding site was divided into six subpockets, which were selected based on their location and function in known kinase-inhibitor structures. Each subpocket is described by the geometric center of the Cα atoms of newly identified anchor residues from the 85 binding site residues defined by KLIFS. 7 The respective subpocket spanning anchor residues ( Table 2) were selected manually after visual inspection of several structures with the aim to define a location that overlays with important parts of known kinase ligands and to provide a good distribution of centers within the pocket. As one example, the subpocket centers within the binding site of the epidermal growth factor receptor (EGFR) kinase are shown in Figure 1.B. Later, fragments will be assigned to the closest subpocket, by measuring their distance to the subpocket centers, and stored in subpocket-specific library pools (subpocket pools). In the following, the residue numbering refers to the numbering used in KLIFS.
Subpocket locations. The adenine pocket (AP), located at the geometric center of the spanning residues 15, 46, 51, and 75, lies next to the hinge region. It is usually occupied by adenine in the ATP-bound state of a kinase and allows to anchor substrate or other compounds by forming up to three hydrogen bonds. The solvent-exposed pocket (SE), defined here by the single residue 51, at the entrance of the binding site adjacent to AP was also called the selectivity entrance by Zhao et al. 28 , as it shows diverse characteristics in different kinases and can therefore be used to achieve improved selectivity. The front pocket (FP), here represented by the geometric center of residues 10, 51, 72, and 81, is occupied by the ribose and phosphate groups of ATP and is partially solvent-exposed. 9 The gate area (GA) acts as a gate between the front cleft (containing AP, FP, and SE) and the back cleft. The GA pocket is defined by the region between the gatekeeper (residue 45), the conserved lysine (residue 17) and the aspartic acid (residue 81) in the DFG motif. The back cleft was split into two subpockets, back pocket I and II (B1 and B2), both lying next to the αC-helix, spanned by residues 28, 38, 43 and 81 as well as 18, 24, 79, and 83, respectively. In addition to the six subpocket pools, a seventh pool X was created to hold fragments that cannot be assigned clearly to a subpocket because the distance to their closest subpocket center exceeds 8 Å. Exceptions for anchor residue definition. The definition of the 85 binding site residues in the KLIFS database is based on a multiple sequence alignment, which can have gaps. It was therefore avoided to set residues with a high gap rate among the structures as anchor residue. Furthermore, some coordinates of an amino acid or a single atom may be missing because they could not be resolved by crystallography. If the coordinates of an anchor residue's Cα atom was missing, the following procedure was applied: If possible, the coordinates were replaced with the geometric center of the two neighboring residues' Cα atoms. If one of those was absent as well, the coordinates of the other neighboring residue were used instead. If both adjacent Cα atoms were missing, the structure was discarded (see Table 1 (B.1)).
Allowed subpocket connections. In order to set up the fragment library, first, the connections between the above defined subpockets were investigated. After manual inspection of the typical structure of known kinase inhibitors (type I and I 1 /2 only), eight allowed subpocket connections were identified as schematically depicted in Figure 1.C. A first investigation of the generated fragments revealed that 95.2% of the molecules comply with this scheme. The remaining 4.8% ligands exhibiting unexpected subpocket connections were manually inspected: For some cases, special rules could be applied to direct fragmentation towards the defined subpocket connections, others were discarded during this analysis step (see "(3) Molecule fragmentation" and Table 1 (B.4)).

(3) Molecule fragmentation
A fragmentation algorithm was implemented to generate fragments from a given ligand in complex with a kinase structure, assign them to subpockets, and thereby populate the fragment library's subpocket pools (see Figure 1). The fragmentation algorithm is schematically depicted in Figure 2. Each kinase-ligand complex is processed successively in the following way (steps (3.1)-(3.4)):  To determine the potential cleaving positions, the co-crystallized ligand of the structure in hand is submitted to an initial fragmentation step, applying the RDKit implementation of the BRICS algorithm. Next, each of the resulting fragments needs to be assigned to a subpocket. Therefore, the geometric center of all atoms (including hydrogens) in the fragment, and its distance to all subpocket centers, is calculated. Then, the fragment is assigned to the subpocket with the closest subpocket center. However, if the closest subpocket to a fragment is more than 8 Å away, this fragment is considered as lying outside of the binding site and assigned to the outlier pool X. Note that the information on the BRICS environment type of each fragment is kept for later recombination.
Subsequently, the cleavage assignments are revised in order to avoid too small fragments in the final fragment library. For each fragment with less than three atoms the neighboring fragments are checked. If all neighboring fragments are assigned to the same subpocket, nothing needs to be done, because by default they will be merged in the next step. If the subpockets of the neighboring fragments differ, the current small fragment is assigned to the subpocket of the largest neighboring fragment. This procedure is repeated until all fragments with less than three atoms are reassigned.
Finally, for each bond between two BRICS fragments, the subpockets of the two fragments are compared. If the two subpockets differ, this bond is stored as a cleaving position for the final fragmentation. If a connection between subpockets FP and B1 or FP and B2 is detected, the distance of the FP fragment is calculated to the GA subpocket center. If this distance is smaller than 5 Å, this fragment is reassigned to GA instead (applied to only 15 cases). Else, the fragment in B1 or B2, respectively, is assigned to pool X. (ii) If any unwanted subpocket connection is still present after this procedure, the complete ligand is excluded from the fragment library (see Table 1 (B.4) and "Results and Discussion: Subpocket connections" for more detail). Summary of removed ligands during fragmentation. During the fragmentation procedure, some complexes were discarded due to the following reasons (Table 1)

(4) Fragment analysis
The following paragraphs describe the different analyses that were performed on the fragment level.
Deduplicated fragments. Several fragments were contained more than once in a subpocket, therefore, a unified set was created for further analysis. First, fragments were simplified by replacing dummy atoms with hydrogens and removing all non-explicit hydrogens (simplified fragments). Second, fragments within one subpocket pool were deduplicated based on their canonical SMILES representation, i.e. in case of identical fragments only one was kept (deduplicated fragments).
Fragment similarity was calculated to allow to analyze the fragment diversity within subpockets as well as within and across kinase groups.
For the subpocket-based analysis, fragments were deduplicated per subpocket and similarities between all pairwise fragment combinations per subpocket were calculated. To this end, the topological RDKit molecular fingerprint 29 was generated for each fragment and the Tanimoto similarity metric was applied. Self-comparisons of fragments were omitted.
To analyze similarities within and across kinase groups, fragments were categorized by subpocket and kinase group (according to the structure they were bound to) and deduplicated per category. For each subpocket (excluding pool X), similarities between all pairwise fragment combinations within and across all kinase groups were calculated as described in the previous paragraph.
Common fragment motifs per subpocket. In order to identify the most common fragments in each subpocket (excluding pool X), the number of occurrences of each fragment was calculated before deduplication based on the simplified fragments. The 50 most common fragments in each subpocket were then clustered based on the Butina algorithm 30 using topological RDKit molecular fingerprints 29 and a distance threshold of 0.6. Note that subpockets B1 and B2 contain less than 50 deduplicated fragments and thus all fragments were chosen for clustering.
Furthermore, representative fragments were extracted manually for each subpocket in order to provide a visual overview on chemical differences and overlaps between subpockets.
Each selected fragment represents a variety of common fragments with similar scaffolds and R-groups.

(5) Fragment recombination
Novel molecules can be created by recombining fragments from the fragment library. For a proof-of-concept study, only a subset of the fragment library was used. The individual steps for data reduction and fragment recombination are explained in this section.
Data reduction. The full fragment library contains 7,486 fragments. In order to reduce the combinatorial library size and run time, a diverse subset of fragments was chosen. (i) All fragments that are not suitable for recombination were removed, i.e. duplicates, fragments in pool X, fragments without dummy atoms (unfragmented ligands), and fragments with dummy atoms only connecting to pool X. Furthermore, only fragments complying with the rule of three, 31 a filter for fragment-likeness, and hinge-like AP fragments were kept. The latter filter checks for at least one hydrogen bond donor or acceptor in the AP fragment, together with at least one aliphatic or aromatic ring. The filtering steps in (i) result in 2,029 fragments. (ii) Per subpocket, a diverse set of fragments was selected for recombination to avoid enumerating highly similar fragments. The Butina algorithm 30 was applied to cluster each subpocket's filtered fragments using topological RDKit molecular fingerprints 29  Recombination procedure. All possible fragment combinations of the above described reduced set were enumerated, while preserving the original subpocket connections when connecting the fragmented bonds using the subpocket-labeled dummy atoms. Recombination started from AP fragments only, while fragments from other subpockets were consecutively added, thereby excluding any recombined molecules not occupying AP. Fragments were combined by adding a bond between two atoms adjacent to dummy atoms, while removing the dummy atoms. Thereby, two fragments were connected via a new bond between two atoms if the following conditions were fulfilled: (i) The first fragment's dummy atom was associated with the same subpocket as the second fragment and vice versa. (ii) The BRICS environment types of the atoms to be connected were matching according to the BRICS rules, 16 in order to preserve synthetic accessibility. The bond type (single or double bond) between dummy atoms was preserved when connecting the fragments. (iii) While connecting the fragments, it was ensured that the resulting molecule did not contain two fragments from the same subpocket, i.e. to occupy one subpocket multiple times.
Recombination was deemed complete if either the molecule had no dummy atoms left to another subpocket (excluding pool X), the molecule's remaining dummy atoms could not be replaced by any matching fragment, or the molecule consisted of 4 fragments. This upper limit of occupied subpockets was introduced, since the majority of kinase ligands occupies only up to 4 subpockets (see Figure 3.A) and molecules occupying more subpockets will mostly not fulfill the requirements of a drug-like molecule due to their size (e.g. Lipinski's rule of five 32 ). Finally, if the resulting recombined molecule contained any remaining dummy atoms, they were replaced with hydrogen atoms. This recombination strategy produced over 6.7 million ligands based on 624 fragments.

Results and Discussion
The main objective of this work has been to decompose kinase ligands with respect to 3D information and to assign each resulting fragment to the kinase subpocket it binds to.
Only kinase-ligand complex structures with molecules targeting the ATP binding site in the DFG-in conformation were selected, such as type I and I 1 /2 inhibitors, to reduce the conformational space of the kinase structures. After filtering the 7,370 starting structures assembled from the KLIFS database, 2,553 protein kinase-ligand structures were chosen for this study.
In a first step, inspired by the functional subpocket annotation in KLIFS, six functionally relevant subpockets were defined covering the ATP binding site. Note that KLIFS specifies eight subpockets, some of which describe relatively small subpockets that were combined into one subpocket in KinFragLib. Too small subpockets are algorithmically less desired in this case, because either too small fragments would be generated or larger fragments would span over several of these small subpockets. Additionally, a solvent-exposed pocket (SE) was introduced in KinFragLib, a region of the binding site occupied by many kinase inhibitors (see subpocket definitions in Table 2).
In a second step, the co-crystallized kinase ligands were fragmented with respect to the subpockets that they occupy, resulting in a kinase-focused fragment library with six subpocket pools (plus the pool X) and 7,486 fragments. An in-depth analysis of the six subpocket pools enabled novel insights into subpocket structural trends, chemical diversity, and typical kinase-binding motifs.
In the last step, a subset of this kinase-focused fragment library was used to create a

Subpockets and fragment library
The generated kinase-focused fragment library allows to analyze kinase-ligand interactions and explore the chemical space of kinase ligands. In total, 7,486 fragments (7,201 fragments without pool X) originating from 2,553 co-crystallized ligands were generated by the fragmentation procedure. After subpocket-based deduplication, 2,977 fragments remain (without pool X). In the following, this fragment library is analyzed with respect to the following aspects: Ligand occupancy and connectivity across subpockets, fragment occurrence, properties, and similarity per subpocket, fragment promiscuity, as well as common fragments and motifs per subpocket. This analysis aims to provide a better understanding of kinase-inhibitor binding and may serve as a valuable starting point for the design of novel kinase inhibitors.

Ligand occupancy across subpockets
The compiled fragment library enables an in-depth analysis of the number of subpockets occupied by the original ligands (Figure 3.A).
Ligands occupying 2-4 subpockets. The majority of ligands occupies two (28%) or three (53%) of the six subpockets. In another 13% of the cases, the ligand spans over four subpockets (examples of such can be seen in Figure 4.A, B and E) . This demonstrates that kinase ligands usually do not fully exploit the available space in the kinase binding site, but target only specific subpockets.
Ligands occupying 1 subpocket. Additionally, 127 ligands (5%) target only one subpocket and were left unfragmented during the fragmentation procedure. Since this study focuses on ligands covering the AP subpocket, all these unfragmented ligands are located in AP. They have an average number of 15 heavy atoms, which is higher than the average over all AP fragments (11 heavy atoms). As shown in Figure S1.B-D, these molecules represent either (i) small fragment-like molecules or (ii) large rigid molecules that contain a large fraction of rings, which are difficult to split for fragmentation most algorithms. An example for the former group (i) is the series of halogenated pyrazoles that stem from a fragment-based approach for druggability assessment and hit generation, 47 see Figure S1.B1-B8. The latter group (ii) contains complete drug-like molecules that either could not be divided because none of the BRICS rules applied or they had a potential BRICS cleavage bond in the initial fragmentation step, which was not broken because the two potential fragments were located in the same subpocket. Furthermore, there are rigid molecules that only contain fused rings with small decorations and, thus, do not apply to any fragmentation approach (such as quinalizarin, a CK2 inhibitor, and derivatives, see Figure S1.C1-C2). An example of a molecule that could not be fragmented by BRICS is the co-crystallized ligand HK4 (CHEMBL248396, 48 pIC 50 = 8.3) bound to the CHK1 structure (PDB:4FST, 49 see Figure S1.D1). The two ring moieties clearly cover distinct subpockets (AP and GA), but could not be assigned to them since no rule exists that allows to split next to a triple bond between two carbon atoms.
Note that the unfragmented ligands cannot be used in the recombination algorithm (because no attachment point resulting from the fragmentation could be assigned). This could be seen as a restriction in available chemical space of the current approach, since each fragmentlike molecule can be seen as a potential starting point for fragment growing. Nevertheless, roughly 28% of the unfragmented ligands were found to be substructures of other original ligands. More than half of these unfragmented ligands are fragment-like (i.e. fulfill the rule of three 31 ). Thus, they are implicitly used in the introduced recombination approach. The remaining 72% unfragmented ligands are however not considered, a limitation which could be addressed by manually adding attachment points on relevant positions.

Ligand connectivity across subpockets
The fragmentation of existing kinase inhibitors yields an overview of how the fragments are arranged within the binding site and throughout the individual subpockets. This allows to analyze via which subpockets the fragments are most frequently connected.
Disallowed subpocket connections/special cases. As described in "Data and Methods", a few design choices were made to only allow the subpocket connections as depicted in Figure 1.C, defined based on prior investigation of known kinase inhibitors. 95.2% of the analyzed molecules follow this scheme, whereas (i) another 4.5% of the molecules could be rescued by the defined rules and (ii) the remaining 0.3% were discarded in this analysis as discussed in the following.
(i) In 113 cases, FP-B2 connections were detected initially. Manual inspection revealed two different methodological drawbacks that could be resolved by the introduced rules: First, in some cases a fragment was assigned to FP because its centroid was slightly closer to FP than GA, although visual inspection showed that the fragment acts as a gate from the front to the back cleft, and should therefore belong to GA (14 cases). The molecules containing these fragments could thus be included by reassigning them to GA (see "(3) Molecule fragmentation"). Second, the FP-B2 connection was observed when the FP fragment was relatively large. While part of it pointed mostly into the solvent, the part was still close enough to B2 and, thus, was assigned to this subpocket. Furthermore, very rare cases were manually observed where the fragment actually covered B2. Since the latter two cases could not be distinguished algorithmically, and the FP-B2 connection is rather unexpected, these B2 fragments were reassigned to pool X (99 cases). The same applies for FP-B1 connections, where each of the two cases described above occurred once.
(ii) Connections between non-adjacent subpockets (e.g. SE-GA, AP-B1) usually occur when one of the two subpockets contains a large BRICS fragment (that cannot be further fragmented), which also spans the respective subpocket in between. This happened only rarely, i.e. for AP-B1 and AP-B2 connections in 4 and 3 cases, respectively. Note that potential SE-GA connections were not counted as these ligands do not contain an AP fragment and were excluded from the study beforehand.
Subpocket connections and fragment arrangements. The fragment connectivity of the co-crystallized ligands was analyzed to identify the typical layout of kinase inhibitors.
Examples of ligands representing different subpocket connections including their frequency are illustrated in Figure 4. The central connections starting from AP are observed most often.
The AP-FP connection is present most frequently in 61.5% of the analyzed ligands, closely followed by the AP-SE and the AP-GA connections with 58.8% and 36.0%, respectively (see Figure 4.A). This agrees with the finding that subpocket pools AP, FP, SE, and GA contain the most fragments in descending order (Figure 3.B). FP-GA and FP-SE connections also occur in more than 7% of the ligands each (see Figure 4.B and C). Generally, the back pockets B1 and B2 are covered less often in the fragment set and they can only be reached through GA. Thus, the GA-B1 or GA-B2 connections appear only in 3.7% and 3.3% of the cases, respectively, while a GA-B1 connection happens slightly more often (see Figure 4.D). B1-B2 connections are present in only 10 ligands (0,4%, see Figure 4.E).
These findings seem to be in good agreement with the inhibitor binding modes reported in KLIFS ( Table 5 in the original publication, 7 see also

Fragment occurrence per subpocket
The number of fragments per subpocket is reported in Figure 3.B and  (Right) Ligand fragmentation with assigned subpockets and dummy atoms (grey). FP) to gain potency, followed by their neighboring subpockets, such as GA targeted to gain selectivity. In this dataset, the remote back pocket is targeted less frequently due to two reasons: First, the 69% of the underlying kinase structures show the αC-in conformation, limiting the available space for ligands in B1 and B2. Second, 73% of the front cleft binder, whereas only 25% of the back cleft binders target the αC-in conformation. Pool X contains 285 additional fragments, i.e. these fragments were classified as lying outside of the main binding site or showing not allowed subpocket connections.

Fragment properties per subpocket
In the following, the fragment pools were analyzed with respect to duplicate fragments and physicochemical properties across subpockets.
Duplicates. On average, 59% of the fragments in each subpocket were present in more than one structure (referred to as duplicates). This can be explained by the traditional medicinal chemistry approach to study a wide range of decorating groups around a shared molecular scaffold and thereby explore structure-activity relationships. Such approaches can result in the crystallization of multiple analogs from the same series. However, this finding also highlights the limited chemical diversity of the known kinase inhibitor space (considering molecules with available crystal structures only). The highest relative number of duplicates was identified in GA (70%), for the other subpockets the values do not differ largely from the average (Figure 3.B). The higher share of duplicates in GA could be explained by the generally smaller fragment size in this subpocket (compared to AP, FP, and SE, see (i) AP fragments generally have a higher number of HBD and HBA, as this part of the inhibitor usually forms hydrogen bonds to the hinge region and acts as anchor to position the ligand. 9 (ii) The logP values vary widely in all subpocket pools. X, FP, and SE fragments have the lowest median logP, i.e. they tend to be more hydrophilic. For SE, this can be explained by the solvent-exposure of this part of the kinase binding site. The same holds for FP, which is also partially solvent-exposed. 9 While the AP fragments usually do provide the hydrogen bonds as anchor, they are often surrounded by a hydrophobic pocket, which could explain the comparatively high logP of these fragments. (iii) AP, FP, and SE fragments tend to be larger in terms of the number of heavy atoms, with AP having the highest median value.
Note that most of the outliers in AP refer to unfragmented ligands as shown in Figure S1.C and S1.D, while outliers in FP mostly refer to large fragments that extent widely into the solvent.
This analysis reflects the general knowledge medicinal chemists have about kinase inhibitors: An HBD-HBA recognition motif is required for binding to the hinge region, the SE subpocket is used to attach functional groups that increase compound solubility, and the GA region accommodates small and hydrophobic moieties. This demonstrates the KinFragLib method's ability to automatically capture the chemical properties of kinase inhibitors.

Fragment similarity per subpocket
In the following, the fragment similarity was analyzed within each subpocket to assess if certain subpockets are occupied by more similar ligands than others. Overall, the intrasubpocket fragment similarity does not differ largely between the subpockets and is generally rather low ( Figure 5.B, Table S1). The highest average intra-subpocket similarity was observed in AP with a mean of 0.14, the lowest in B1 (0.07), B2 (0.09), and FP (0.09). A higher similarity in AP can be explained by the lower flexibility of this kinase region and the targeted design of chemical moieties interacting specifically with the hinge region. The low average similarity within FP might be observed due to the larger space around the FP center compared to the other subpockets, allowing a higher diversity in FP fragments. The low similarity in B1 and B2 is probably the result of the small amount of data available for these subpockets.
In general, this analysis indicates that after removing duplicates in each subpocket pool, a high diversity of chemical structure is present in the fragment pools, which underlines the potential of the KinFragLib to generate novel chemical matter.

Fragment promiscuity
Fragment promiscuity was addressed from two angles: (i) Are fragments more similar within kinase groups than across kinase groups? (ii) If fragments are observed multiple times in the same subpocket pool, are the respective ligands co-crystallized with different kinases (or kinases from the same group)?
(i) All fragments were grouped by subpockets (excluding pool X) and kinase groups.
Within each of these subsets, fragments were deduplicated and similarities for all pairwise fragment combinations were calculated and pooled by kinase groups. This results in fragment similarities per kinase group, while in each kinase group only fragments were compared that occupy the same subpocket. If fragments were indeed selective for specific kinase groups, a higher fragment similarity would be observed within kinase groups compared to across all kinase groups (i.e. pooling all similarities from all subpockets). Nevertheless, no significant difference can be observed ( Figure 5.C). This result indicates that the collected fragments are potentially useful for the design of an inhibitor of any target kinase.
(ii) All fragments were grouped by and deduplicated within subpockets (excluding pool X), while the number of duplicates was kept per deduplicated fragment: 67% represent singletons (appear only once per subpocket) and 12% originate from different molecules that were bound to the exact same kinase and subpocket. One interpretation of this result can be that 79% of the collected fragments have the potential to be part of a molecule that specifically inhibits one kinase. This is in line with the arguments by Xing et al. 50 and Hu and Bajorath 51 after exploring kinase hinge binding scaffolds. Another interpretation can be that 4 out of 5 fragments have never been explored on kinase targets from a different family.
Using this information to create kinase-focused chemical matter could therefore be extremely useful. The remaining 21% of the fragments were bound to more than one kinase. More than three quarter of this fragment set even co-crystallized with kinases from more than one kinase group. This result supports the conclusion that fragments can be promiscuous, i.e. identical fragments can interact with multiple different kinase targets. Instead, the combination of different fragments could be the key for kinase selectivity.

Common fragments and motifs per subpocket
In order to illustrate the chemical nature of the fragments within each subpocket pool and demonstrate differences and similarities across them, representative fragments are shown in Figure 6.
The  Table 6 in the original publication).
In order to assess overlaps and differences in results from different approaches, hinge binding fragments from literature are compared to fragments from the hinge-equivalent subpocket in this study, i.e. the AP subpocket building the AP subpocket pool. Xing et al. 50 and Mukherjee et al. 21 both report their 10 most common hinge scaffolds/fragments ( Figure 1 and Figure 7 in the original publications, respectively). Excluding adenine and staurosporine from the comparison which were removed from this library (see Table 1 (Table S2). While both reported methods check for hydrogen bonding between the fragment and the hinge region in crystal structures, KinFragLib is able to retrieve hinge-contacting fragments without specifically searching for hinge contacts but by checking the position within the binding site.

Recombined molecules
To exemplify the power of the combinatorial library, molecules were enumerated based on a reduced and diverse subset of the fragment library consisting of 624 fragments (see Subsection "(5) Fragment recombination"   Figure S2-S7. Note that dummy atoms were replaced by hydrogen atoms. erated coincidentally from different fragment combinations.

Recombined original ligands from KLIFS
An important way to control the relevance of the generated chemical matter is to demonstrate this workflow's ability to reconstruct the ligands from which the reduced set of 624 fragments originate (reduced original ligands): 35 recombined molecules have exact matches and 324 recombined molecules are substructures. Note that only a subset of fragments (624 out of 2,977) was used for recombination, thus only a fraction of original ligands can be retrieved.

Recombined ChEMBL molecules
The search for exact matches in ChEMBL 33 (1,782,229 molecules) revealed that only 298 of the over 6.7 million recombined molecules have already been described in ChEMBL. Only 218 matching molecules remain after removing the exact and substructure matches in the "reduced original ligands" used for the fragmentation. Consulting bioactivity data available in ChEMBL, 47 out of these 218 molecules have been shown to be active against human target(s) (activity is here defined as IC 50 ≤ 500 nM): 44 are active against kinases, two against cytochrome P450, and one against an voltage-gated ion channel. In total, 10 molecules show a high activity against kinases with an IC 50 ≤ 5 nM (see Figure 7). More details on the ChEMBL IDs and molecular structures are shown in Table S3 and Figure S8. This shows strong evidence that the library contains molecules with a high chance of exhibiting kinase activity.

Chemical novelty (with respect to KLIFS subset and ChEMBL)
Excluding the 359 original ligands (35 exact Figure S9).
At the same time, as discussed before, 35 original kinase inhibitors from KLIFS and 44 additional potent kinase inhibitors in ChEMBL could be recombined, while using only a subset of the fragment library. This indicates that the novel fragment library can generate large libraries of novel chemical matter, while being tailored for the design of kinase inhibitors.

Properties of recombined molecules
The majority of the 6.7 million recombined molecules include fragments able to occupy 4 subpockets (90%), whereas the majority of original ligands is smaller and occupies three (53%) or two (28%) subpockets only. This is a consequence of a choice made in order to illustrate the power of exhaustive in silico library enumeration (the linking of fragments reaching up to four subpockets was allowed in this case). But most importantly, the presented workflow allows for tailored library design that can easily be adapted to fulfill the requirements of a particular project.
While 86% of all kinase inhibitors in clinical trails (dataset from 2020-07-15 downloaded from PKIDB 23 ) fulfill Lipinski's rule of five, still 63% of the combinatorial library (4.2 million molecules) comply with Lipinsik's rule of five (Figure 8), representing a large kinase-focused library to be used for virtual screening studies.
Note that only a subset of fragments was used to generate the recombined library, thus, even larger libraries could be generated by taking into account all fragments identified in this study.

Conclusion
Kinases are one of the most studied protein families in medicinal chemistry, resulting in an amount of available data too large to be handled by a human brain. By combining a precise cartography of the ATP binding site and a tailored fragmentation method, KinFragLib allows to read, fragment, and organize by subpocket inhibitors co-crystallized with a kinase in the DFG-in conformation. The subsequent analysis of the chemical matter of the compiled fragments is in agreement with the general knowledge of medicinal chemists, identifying small and lipophilic fragments in the gatekeeper area, solubilizing fragments in the front pocket, and typical hinge binders for the adenine pocket. While this analysis is also in line with previous works conducted for the hinge binding fragments, this study provides for the first time a fragment library that is organized by subpocket, unveiling subpocket occupation and connection frequencies. It was found that chemically diverse fragments can bind the same subpocket. Furthermore, 79% of the identified fragments were only observed in one kinase structure, but the other 21% could bind the same subpocket of different kinase groups.
This result indicates that a fragment binding one kinase subpocket is likely to bind the same region of other kinases. Therefore, the high chemical diversity of the generated fragment library is a rich source of inspiration for building novel kinase inhibitors. To investigate this possibility, a library of recombined fragments was enumerated in silico (using a diverse subset of the fragments only). The resulting virtual library containing over 6.7 million molecules was compared to the ChEMBL database (exact matches), indicating 99.99% of novel chemical matter. The rare exceptions of compounds with precedence in the literature include predominately known kinase inhibitors. These results clearly highlight the enormous potential of this fragment library for the design of novel kinase inhibitors.
The reported method focused on two types on kinase inhibitors (type I and I 1 /2), however other libraries could be generated by fragmenting other kinase inhibitor types. Similarly, the same protocol could be applied to a more specific set of ligands, e.g to design a library of fragments specific of a kinase group, or a different dataset of ligand-kinase 3D structures.
And finally, this workflow is also perfectly suited to support a fragment-growing approach after one novel fragment has been validated in a kinase subpocket.

Code and Data Availability
The generated fragments and recombined ligands, the full fragment and combinatorial library analysis, and a quick start notebook on how to access the data are freely available at