Balancing data on proteochemometrics activity classiﬁcation

In silico analysis of biological activity data has become an essential technique in pharmaceutical development. Speciﬁcally, the so-called proteochemometric models aim to share information between targets in machine learning ligand-target activity prediction models. However, bioactivity datasets used in proteochemometrics modeling are usually imbalanced, which could potentially aﬀect the performance of the models. In this work, we explored the eﬀect of diﬀerent balancing strategies in deep learning proteochemometric target-compound activity classiﬁcation models while controlling for the compound series bias through clustering. These strategies were: (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering and (4) semi_resampling. These schemas were evaluated in kinases and GPCRs from BindingDB. We observed that the predicted proportion of positives was driven by the actual data balance in the test set. Additionally, it was conﬁrmed that data balance had an impact on the performance estimates of the proteochemometrics model. We recommend a combination of data augmentation and clustering in the training set (semi_resampling) in order to


Introduction
The discovery, design and bring-to-market of a novel small-molecule drug is a very challenging process, and very expensive in terms of money, time and effort. 1 Computer-Assisted Drug Design (CADD) methods can help to improve and refine the identification of hits in the first steps of drug development, thus having a huge positive impact on the costs of the whole process. 2 Traditionally, interactions between ligands and targets have been predicted in CADD through a Quantitative Structure-Activity Relationship (QSAR) approach. 3 In QSAR, a target is fixed and only information from compounds is used for modeling and predicting binding for said target. However, the compartmentalized nature of QSAR does not allow for discovering new cross-interactions between ligand and targets for which no training data is available. 2 Proteochemometrics modeling (PCM) is an extension of QSAR which overcomes this drawback by combining information of both ligand and protein descriptors on a supervised prediction model. PCM allows for the integration of different sources of data in one model and for the general prediction of which ligands will bind to which targets. 4 Both PCM and QSAR usally apply machine learning (ML) techniques such as random forests, support vector machine, logistic regression or partial least squares. 2,4 Following the trends in other fields and the growing availability of data, deep learning (DL) has also been increasingly and succesfully applied on bioactivity prediction, 5 specially on QSAR modeling. 6 The application of DL to PCM followed, taking advantage of public databases [7][8][9] and improving the descriptor representation. 10,11 However, an important issue for PCM and QSAR DL models is the amount and quality of data when compared to other fields of application, since increasing the number of data samples in drug discovery is expensive and thus, often infeasible. 12 This poses a problem, since neural networks require a large quantity of training data in order to actually learn.
While in other fields this problem is alleviated through data augmentation, i.e. an artificial increase of the number of observations of the training set to help the model generalize, this regularization technique is not yet commonly used in CADD. Some studies have considered different variants of the SMILES of each molecule as a way of data augmentation, 13,14 but despite its proven benefits, its use is not widespread yet. This is partly due to the lack of consensus in the input representations, where alternatives to SMILES are often used.
Another factor highly affecting QSAR and PCM models is data imbalance, since the class definitions based on bioactivity data can result in highly skewed labels. In this regard, Zakharov et al 15 explored how data balancing affected self-consistent regression QSAR models using highly imbalanced PubChem bioassays. The study proposed a method including costsensitive learning and under-sampling approaches to obtain more accurate predictions. Using the same data, Korkmaz explored how data balancing affected DL-based QSAR models. 16 The study concluded that imbalance has indeed a negative impact on the performance of the models, but that this impact could be alleviated by applying oversampling methods like SMOTE (Synthetic Minority Oversampling Technique) 17 on the fingeprint representations of the molecules. Besides, oversampling methods could also serve the purpose of augmenting the original dataset.
While the effect of data imbalance on model performance has been studied for shallow ML and DL QSAR, up to our knowledge, there are not analogous studies yet for PCM.
In PCM, modeling information between targets is shared, which may compensate those for which activity data is very imbalanced. However, it is still to be proved if this compensation does happen or if the results are actually dominated by the original imbalance of each target.
Recently it has been shown that for the validation of PCM models, it is important to control the chemical series bias through clustering techniques in order to get more reliable performance estimates. 8,18 This adds a complexity layer to the imbalance handling, since clustering can affect the data balance in PCM. Since Korkmaz and Zakharov et al did not consider the potential similarity between different compounds when validating their results, 15,16 its impact on data balancing is yet to be tested.
In this paper, we study the effect of different balancing strategies in DL-based PCM target-compound activity classification models. While handling data imbalance, we also study how to integrate the compounds clustering in this process. We describe the behavior of model predictions and performance according to imbalance handling.

Data
We evaluated the different balancing models on the benchmark dataset used in DeepAffinity. 19 The original dataset contains binding data from BindingDB, 20 merged with the amino acid sequence information from UniRef 21 and the SMILES representation of compounds from STITCH. 22 The original dataset consisted on IC50, K i or K d values from 829,033 compoundprotein pairs. We classified the dataset proteins into the main protein families according to the release 2018_09 from Uniprot 23 and restricted our study to proteins of the kinase and G protein-coupled receptors families (separately). Binding activities were in logarithm form, so a threshold of 6 was applied in order to have binary labels for classification (active/inactive). Table 1 summarizes the final dataset we used in our analysis. The same descriptive table, but for GPCR family, can be seen in Table S1 of the Supporting Information. In Figures S1-S4 from the Supporting Information, the proportion of actives/inactives for each protein of the kinase and GPCRs protein families is represented in more detail.

Descriptors
We represented compounds by their molecular fingerprints, in which structural information is represented by bits in a bit string. We used the fingerprints from PubChem 24 provided in DeepAffinity. 19 In these, basic substructures of compounds are encoded in a 1D binary vector with a length of 881 bits.
We represented proteins by raw amino acid sequences transformed to one-hot encoding.
Each amino acid was represented by a binary vector of length 26. Protein sequences were then normalized to the maximum length of 1499. Those sequences shorter than 1499 were zero-padded. According to the recommendation of our previous work, 25 we tuned the padding type and obtained the best results with pre-padding (adding zeros to the beginning of the sequence).

Validation strategy
A splitting strategy based on compound clustering (both of actives and inactives) was applied to the bioactivity data, omitting target information. Clustering-based validation strategies have been used to avoid the compound series bias, making sure that there are no similar molecules both in training, validation and test sets. 18,26,27 We followed the implementation of our previous study on cross-validation strategies in PCM, 8 where K-means clustering with k = 100 was applied to the fingerprint description of the compounds. Data was divided in training, validation and test sets with a proportion of 80/10/10%. This splitting was randomly performed 10 times (folds) in order to test the consistency of the results, thus training and testing each model in 10 different data partitions. As further explained in the next subsection, for some balancing strategies the clustering was applied before the resampling and for others it was applied afterwards.

Balancing strategies
We chose an oversampling method to balance data since oversampling was shown to improve performance in the Korkmaz study of data imbalance in DL-based QSAR 16 and in a systematic study of data imbalance with CNNs. 28 Oversampling methods increase the number of samples in the minority class to create a balanced data set. Specifically, we used the SMOTE oversampling technique, 17 which creates synthetic data points of the minority class similar to those available. Resampling with SMOTE was done in a per protein basis, so that each protein would be balanced. Some proteins had to be discarded in certain strategies, since there were either only active or inactive ligands, or the number of samples in the minority class was smaller than the number of neighbors used for constructing the synthetic samples (k = 5) and SMOTE was not applicable.
Unlike Korkmaz, that applied data balancing methods to each training set, 16 we tested four different combinations of balancing, data clustering and splitting (see Figure 1): no_resampling, in which bioactivity data for each protein was taken as it was, and clustering was applied in order to perform the splitting; resampling_after_clustering, in which after clustering data and splitting it into training, validation and test, each protein activity data in each set was resampled and attained a 50% actives/inactives proportion; resam-pling_before_clustering, in which, opposite to the previous strategy, resampling was applied prior to clustering and splitting, so while the global protein-wise proportion of actives/inactives was 50%, it did not have to be 50% within each splitting set; and semi_resampling, in which the splitting performed in the no_resampling strategy was reused, the test set was kept without resampling but the training+validation set was resampled, re-clustered and re-splitted into train and validation.

Prediction Models
We built a DL model for studying the impact of different data balancing strategies in state-ofthe-art PCM. A random prediction was generated to have an absolute, input-naïve baseline to compare our results with. (inactives), then concatenating both and shuffling. This procedure keeps the active/inactive balance by design while producing random activity predictions.

Deep Learning Model
We studied the impact of data balancing strategies on a DL model. We followed the Korkmaz strategy of selecting a simple, well-established architecture whose complexity issues would not be a confounder of the factor under study. 16 We refrained from using Long Short-Term Memory networks since they have convergence issues when training sequences longer than 1000 elements. 29 Model hyperparameters were tuned using the validation set, choosing the simplest working architecture. As in our previous work, 8 the DL PCM model consisted of two analysis blocks. The amino acid sequence analysis block was a 1D convolutional neural network. The fingerprints analysis block consisted of a feed-forward neural network.
Dropout was used in both branches to prevent overfitting. 30 The representations built by the compound and target analysis blocks were then merged and the information was passed through a softmax activation unit, which quantified the ligand-target pair activity probability. A schematic representation of the DL-based PCM model can be found in Figure S5 of the Supporting Information, along with further details on the optimised hyperparameters.

Implementation
We trained every model with an Adam optimizer 31 (learning rate= 5 × 10 −4 , β 1 = 0.1, β 2 = 0.001, = 1 × 10 −8 and decay rate defined as the learning rate/number of epochs) for 100 epochs, with a batch size of 128 both for training and validation. Models were implemented in Python 3.6.9 (Keras 32 2.3.1 using Tensorflow 33 2.1.0 as backend) and run on two NVIDIA GeForce GTX 1070 GPUs. SMOTE data balancing was applied using the imbalanced-learn Python package. 34 The statistical processing of results was performed in R software (3.6.3). 35

Characterization of data balance
The data balancing strategy had an impact on the actual data balance, defined as the proportion of active molecules for a protein.
Data balance (protein) = Proportion of actives (protein) = n_active_compounds n_total_compounds Thus, a comprehensive analysis of data balance was carried to better understand and interpret performance results. For each of the balancing strategies, the original distribution of active ratios per protein was characterized. We also compared the original imbalance of the training and test sets for each strategy to explore possible trends, and studied the effect that other covariates (the protein length and the number of interactions of each protein in its corresponding set and fold) might have on the original test set imbalance.
The next key question was to narrow down the factor driving the proportion of actives in the predicted data (as opposed to the original data). The main options under consideration were: (1) a constant, global imbalance that the model would learn from the whole dataset; (2) the protein-wise imbalance that the model would learn in the training set and (3) a test set-driven imbalance, based on its actual imbalance.
In the training process, the weights of the selected model were those from the epoch with the maximum accuracy (proportion of correct predictions) on the validation set. This process was run for each strategy and fold. Then, each selected model was used to predict on their corresponding test set. After the binarization of the test set predictions (probability threshold of 0.5), the proportion of predicted actives was computed by protein and also compared to the ratios of the original test and training sets.

Performance Metrics
The resampling strategies were assessed with various performance metrics for binary classifiers and prioritisers. The selection was based on those used by Korkmaz: 16 balanced accuracy, F1 score, Matthews correlation coefficient (MCC) and area under the ROC curve (AUROC). All of them are insensitive to class imbalance. In the case of F1-score, we used the macro-average, which is computed by averaging the F1-score for the active and inactive labels. Further details on the definition of these metrics can be found in the Supporting Information.
The performance metrics were computed on the predictions of each selected model in its corresponding test set. AUROC was computed from raw predicted probabilities, while F1score, balanced accuracy and MCC were derived from the binarized predictions. We tested the significance of the differences between strategies by means of nonparametric two-sided Wilcoxon test for paired samples. 36

Explanatory Models
Performance metrics and predicted ratios were further described through linear models built upon the different combination of variables considered in this analysis. Our prior work in similar scopes had found them insightful, since they allow for a statistical analysis of the contribution of each factor under study. 8,25,37 Each of the data points used for fitting a explanatory linear model corresponded to a different protein. Simpler claims were investigated with Pearson's r for linear correlation, using confidence intervals (CI) and p-values for significance.
On the one hand, the predicted ratio of actives (r pred ) was modelled through the quasib-inomial logistic model 38 in equation 2, stratified by strategy, in order to quantify the effect of different variables of interest.
Specifically, the main variables of interest in this model were the actual ratios in the training (r training ) and in the test (r test ) sets, both numeric between 0 and 1. As additional covariates, the number of interactions (n int ) and the sequence length (n seq ) (both numerical) and the fold number (k f old , categorical) were also included. This model was not computed for the resampling_after_clustering strategy, since the data balance (and thus, the predicted active ratio) is enforced.
On the other hand, each performance metric was explained through the linear model described by the Equation 2.
The response was the quantitative metric of interest in each case (one model per metric), while strategy was categorical (no_resampling, resampling_after_clustering, resam-pling_before_clustering, semi_resampling). The same covariates as in Equation 1 were added.
However, before evaluating the DL model, the performance metrics of the baseline were characterised: the strategy variable was tested with a type 3 analysis of variance (ANOVA) 39 in order to pinpoint the imbalance-sensitive and insensitive metrics. Metrics were called imbalance-sensitive if the imbalance-aware random baseline exhibited different performances between resampling strategies.
The imbalance insensitive metric models were fitted analogously to the baseline performance models (with Equation 2). However, to address the pitfalls of the direct comparison of metrics whose baselines might differ, imbalance sensitive performance metrics were defined and modelled as follows: And thus, adjusted performance metrics were also described with the Equation 2 but changing the response to adj_metric of Equation 3: Note that while all the metrics but MCC were non-negative, the adjusted metrics could show negative values when the performance of the DL model was lower than that of the baseline.
Reference categories for categorical variables were no_resampling for strategy and 0 for fold. Each term of the fitted model represents the difference between its specified category and the reference category of that variable.

Characterization of the original data balance
Distribution of the actives ratio    Resampling_after_clustering always kept balanced proteins, by design.

Other covariates
The effect that the number of interactions for each protein in its corresponding set and fold, and the protein length (i.e. number of amino acids) had on the test set imbalance was investigated (Figures S7-S8 and Tables S4-S5 of the Supporting Information). Proteins with greatest imbalance tended to be among those with the least interactions ( Analysis of the predicted proportions Figure 3 represents the ratio of predicted actives by protein and    (Table S9), both the original training (β = 8.312, p < 10 −16 ) and test ratios (β = 1.102, p = 2.6 · 10 −9 ) had positive, significant effects on the predicted actives ratio. In the three models, the number of interactions per protein had a significant, negative effect (β = −0.391, −0.396 and −1.24, all p < 10 −16 ), and some of the folds entailed significant variations of the predicted ratio.

Performance metrics
Baseline performance Figure 5 shows a fold-averaged picture of the metrics by protein and by model type (DL or input-naïve baseline). Visual inspection suggested that the F1-score, accuracy, and possibly balanced accuracy were affected by the baseline data imbalance. To quantify this finding, the model in Equation 2 was fitted to the baseline performance metrics. According to Table S10 of the Supporting Information, the strategy term was significant (type 3 ANOVA, p < 10 −16 ,  (Table S12 of the Supporting Information). Figure 5 brought the dilemma of direct strategy comparison with imbalance-sensitive metrics, which was especially apparent for the F1-score and its high baseline in no_resampling (quartiles: Q1 = 0.428, median of 0.611, Q3 = 0.756, Table S10 of the Supporting Information).
Absolute, baseline-naïve performance Absolute metric models (not accounting for baselines) were fitted following Equation 2, analogously to the baseline performance models.
The strategy term would always explain variance (type 3 ANOVA, p-values ranged between 2.89·10 −15 and p < 10 −16 , see Table S13 in the Supporting Information). The models showed different behaviour in imbalance-sensitive and insensitive metrics (Table S14 of  Baseline-adjusted performance A descriptive plot of the adjusted metrics ( Figure 6) pointed to a different scenario than that of the the adjusted ones ( Figure 5).
Again, the strategy term was always significant (type 3 ANOVA, p-values ranged between 2.78 · 10 −9 and p < 10 −16 , Table S16 of the Supporting Information). Baseline adjustment brought a unified behaviour across the models (Table S17 of

GPCRs
We repeated all the previous analysis on the GPCR family to confirm whether the claims obtained for the kinases protein family could be generalized to other families. While their active proportion distributions were not too different, GPCR proteins were more imbalanced towards the actives than kinases ( Figure S3 of the Supporting Information).
The main results obtained in kinases also apply to GPCRs.  Table S19

The impact of clustering in final imbalance was strategy-dependent
This study is focused on the characterization of the data imbalance present in bioactivity datasets, as well as how to address it. Bioactivity data also poses the problem of chemical series, i.e. sets of similar molecules with similar activities, that result in inflated performance metrics when split between training and test sets. We addressed those via a clustering prior to the splitting, ensuring that similar molecules would belong to the same set.
The first observation was that clustering modified data imbalance in a strategy-dependent way. When the starting set was perfectly balanced (strategy resampling_before_clustering), clustering and splitting induced a degree of imbalance, particularly visible in the heavier tails of the active ratios distributions in the test set. Compared to training, the lower sample sizes in the test set may also cause extreme imbalances more often. On the other end, this effect was only moderate in no_resampling, where the distribution of actives ratio was similar in train and test, but that of test had more extreme proteins with either all actives or all inactives.
Besides the overall changes in data imbalance, strategies differed in how the imbalance of a certain protein in the training set would translate to the test set. The positive trend in no_resampling suggests that existing data imbalances tended to persist after the clustering and splitting. The negative trend in resampling_before_clustering hints that, in the absence of imbalance, clustering will induce it. The flat trend in semi_resampling supports that the imbalance induced with the clustering in the training set, which was balanced with SMOTE beforehand, is independent from the original imbalance in the dataset (present in the test set).

The predicted actives proportion was driven by the test set rather than the training
The original distribution of actives ratio in each of the balancing strategies affected the predicted actives ratio by the models. Due to the lack of correlation between training and test ratios (Figure 2), the semi_resampling strategy was the ideal scenario to disentangle their effect on the predicted ratio of actives (see model in table S3 of the Supporting Information).
Its additive model suggested that the original ratio of actives in test explained the predicted proportions, rather than the training ratio. We also found that the number of interactions per protein was a relevant factor: the more interactions, the less active proportion, suggesting that the extreme cases with all predicted as actives tended to be proteins with few interactions.
Likewise, resampling_before_clustering showed negative correlation between training and test ratios, also providing a reasonably good scenario to distinguish their effects (Table   S3 from the Supporting Information). Its explanatory model confirmed both conclusions from the model in the semi_resampling strategy, with similar estimates (Table S8).
The explanatory model for the no_resampling strategy (Table S9 of the Supporting Information) suffered from the positive correlation between training and test ratios, which could be confounded. Both original training and test ratios showed a positive effect on the predicted fraction of actives. Although the estimate was larger and more significant for the training ratio coefficient, the confounding effect and the very skewed distribution of the predicted ratios deemed this model inconclusive.

Imbalance-sensitive metrics required baseline adjustment
The prediction task studied here posed a particular challenge: data imbalance happened on a protein basis, and the imbalance of certain proteins could be extreme (very low or high), moving away from the global actives ratio. Each resampling strategy would lead to different protein-wise imbalance patterns. The baseline performance of some metrics (accuracy, F1 score and balanced accuracy) was different between strategies, while it was constant for others (AUROC and MCC). The data-driven division into imbalance-sensitive and insensitive metrics was an important step to understand the opposite conclusions reached within each metric type after direct performance comparison between strategies (Figure 7).
The direct comparison of resampling strategies with imbalance-sensitive metrics would be confounded by the imbalance-induced bias in the metrics and the protein-wise imbalance differences between strategies. We found that adjusting by the baseline metrics (see Equation 4) brought an agreement in the conclusions obtained by both imbalance-sensitive and insensitive metrics. In turn, the same conclusions were obtainable by direct comparison of imbalance-insensitive metrics. Because of this, our recommendation is to include imbalance-aware baselines and to adjust imbalance-sensitive metrics when used for model selection.

Augmenting the test set was the largest performance drive
Our results showed that the largest impact in performance estimates was the application of data augmentation to the test set: resampling_before_clustering and resampling_after_clustering tended to outperform semi_resampling and no_resampling. However, augmenting the test set might not faithfully reflect new data anymore, and could artificially inflate the performance estimates: models may specialize in discriminating between original and resampled data points instead of actives and inactives.

Resampling improved performance when keeping the original test set
On the other hand, semi_resampling outperformed no_resampling in four out of five metrics (Tukey's method, p < 0.05, Figure S13 of the Supporting Information), which supported data augmentation usefulness even if the data balance in the test set differed from that of the training set. This was consistent with the observation that the main influence on the predicted actives ratio in the test set were their actual ratios in the test set instead of the original ratios in the training set. Combined with the less skewed distributions of predicted active ratios of semi_resampling against no_resampling (Figure 3), we recommend semi_resampling for future studies.
Using GPCRs as an external protein family dataset for validation suggests replicability of the main guidelines The results obtained by the kinases and the GPCR proteins, used as an external validation set for the model fitting and evaluation, point to the same general picture with aligned conclusions. The differences found (the effect of the sequence length on protein imbalance and n_interactions on predicted actives proportion is different to GPCRs) could be due to the fact that there is more imbalance of the GPCRs towards the actives. However, these results lead us to think that the guidelines for proteochemometrics models of this study provide sensible defaults to more protein families.

Similarities with existing literature
In this paper we have confirmed that data balance has an impact in DL proteochemometric target-compound activity models. Zakharov et al and Korkmaz arrived to a similar conclusion in a QSAR setting, 15,16 the latter also using DNN models for classification. More specifically, Korkmaz stated that the higher the imbalance for a protein, the worse the model performance (measured by F1-score and MCC).
These studies got the best performances by controlling data balance by means of un- More importantly, Zakharov and Korkmaz studies did not take into account the control of the compound series bias. This step is necessary for obtaining realistic performance estimates in a real-world setting. 8,18 Not only we accounted for it, but we also investigated if the stage in which the compound series control was introduced, in combination with the data augmentation (before or after applying SMOTE), had an impact in the outcome.
Indeed, the order had an impact in the model performance and needed careful consideration. Resampling_before_clustering solved the global imbalance of the dataset, but clustering after oversampling would lead again to a protein-wise imbalance. Analogously, semi_resampling resampled the training and validation sets, but imbalance returned after their clustering. On the contrary, resampling_after_clustering first corrected the problem of similar compounds, and then augmented the data to reach a protein-wise balance.

Limitations and future work
This study continues our incremental work on recommendations for DL models regarding input encoding 25 and control of chemical series. 8 While this study was limited to one architecture and two protein families, it provides a foundation to understand the basic behaviour of PCM models, insights on how to adjust performance metrics for a protein-wise analysis, and a first step towards exploring more general questions. Those could include architecturecentric analyses to confirm if the same trends are observed when changing the layers or the model structure, or using other protein families with a different distribution of actives ratios, which may be flat or skewed to the inactives.

Conclusion
Although the effect of data balance and resampling techniques had been analysed for QSAR models, it had not been studied yet in the context of proteochemometrics models, even if the bioactivity datasets used in this setting are usually imbalanced. In this paper, we have tested four different combinations of data oversampling (through SMOTE) and clustering for controlling compounds similarity. While the clustering avoids overly optimistic performance estimates, it could introduce more data imbalance (in the form of splittings having proteins with mostly active or inactive compounds). Despite this potential conflict between the resampling and the clustering, we found that resampling was useful to improve the model behaviour and performance.
Some common performance metrics were affected by the data imbalance and yielded misleading trends. We included an imbalance-aware random baseline and defined baselineadjusted metrics to overcome this issue, especially in F1-score and accuracy. After baseline adjustment, the metrics provided a unified picture: the largest impact in performance estimates came from the application of data augmentation to the test set (resampling_before_clustering and resampling_after_clustering outperformed semi_resampling and no_resampling). How-ever, augmenting the test set may not reflect a realistic scenario.
On the other hand, semi_resampling outperformed no_resampling in four out of five adjusted metrics and provided a more equalized distribution of predicted actives ratio. This confirmed the data augmentation usefulness even if the data balance in the test set differed from that of the training set. This was consistent with the finding that the predicted proportion of positives of the proteochemometrics model was explained by the actual data balance in the test set, rather than that of the training set. We also found that proteins with more interactions were better predicted.
Our recommendation is thus to use the semi_resampling strategy, i.e. clustering compounds to separate training and validation from test sets, resampling training and validation and then clustering compounds again to definitely split training and validation sets. This was carried out on the kinases protein family and further confirmed on the GPCR family.
While we cannot extrapolate these results to all the proteins and imbalance distributions, this sets a sensible starting point for improving proteochemometrics modelling and remains consistent with the corresponding data imbalance studies on QSAR models.

Data and code availability
The bioactivity data used in our analysis is publicly available in the repository https: //github.com/Shen-Lab/DeepAffinity. 19 The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark. B2SLab is certified as 2017 SGR 952.

Supporting Information Available
The following files are available free of charge.
•  Figure 1: Description of the four balancing strategies that were applied to the bioactivity data. Resampling_before_clustering, where resampling per protein is applied prior to clustering and splitting; resampling_after_clustering, where data is first clustered and splitted and then each protein activity data in each set is resampled; semi_resampling, in which the splitting is performed and then the test set is kept without resampling but the training+validation set is resampled and clustered; and no_resampling, in which the imbalance of the original data is kept and clustering is applied prior to splitting. Actives ratio (training set) Actives ratio (test set) Figure 2: Comparison of the training and test original active ratios, by resampling strategy. Linear fit trends were added by strategy, and the shadowed areas indicated the 95% CI of the expected value. Each plot combines all the folds.