Probabilistic Random Forest Improves Bioactivity Predictions Close to the Classification Threshold by Taking into Account Experimental Uncertainty

07 May 2021, Version 1
This content is a preprint and has not undergone peer review at the time of posting.


In the context of small molecule property prediction, experimental errors are usually a neglected aspect during model generation. The main caveat to binary classification approaches is that they weight minority cases close to the threshold boundary equivalently in distinguishing between activity classes. For example, a pXC50 activity value of 5.1 or 4.9 are treated equally important in contributing to the opposing activity (e.g., classification threshold of 5), even though experimental error may not afford such discriminatory accuracy. This is detrimental in practice and therefore it is equally important to evaluate the presence of experimental error in databases and apply methodologies to account for variability in experiments and uncertainty near the decision boundary.

In order to improve upon this, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF comprises a modification to the long-established Random Forest (RF), to take into account uncertainties in the assigned classes (i.e., activity labels). This enables representing the activity in a framework in-between the classification and regression architecture, with philosophical differences from either approach. Compared to classification, this approach enables better representation of factors increasing/decreasing inactivity. Conversely, one can utilize all data (even delimited/operand/censored data far from a cut-off) at the same time as taking into account the granularity around the cut-off, compared to a classical regression framework. The algorithm was applied toward ~550 target prediction tasks from ChEMBL and PubChem. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information is not considered in any way in the original RF algorithm. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold). The RF models gave errors smaller than the experimental uncertainty, which could indicate that they are overtrained and/or over-confident. Overall, we show that PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold. With this approach, we present, to our knowledge, for the first time an application of probabilistic modelling of activity data for target prediction using the PRF algorithm.


Probabilistic Random Forest
Cumulative Distribution Function (CDF)
Uncertainty Estimation
Target prediction
QSAR Modeling
Experimental Error

Supplementary materials

Mervin Manuscript


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.