These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.

Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning

submitted on 26.09.2018 and posted on 27.09.2018 by Angela Lopez-del Rio, Alfons Nonell-Canals, David Vidal, Alexandre Perera-Lluna
Binding prediction between targets and drug-like compounds through Deep Neural Networks have generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are: (1) random splitting, (2) splitting based on K-means clustering (both of actives and inactives), (3) splitting based on source database and (4) splitting based both in the clustering and in the source database. These schemas are applied to a Deep Learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two different ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification performance of our Deep Learning-based proteochemometrics model is comparable to the state of the art. Our results show that the lack of generalization of these models is due to a bias in public molecular databases and that a restrictive cross-validation schema based on compounds clustering leads to worse but more robust and credible results. Our results also show better performance when representing molecules by their fingerprints.


This research was partially supported by an Industrial Doctorate grant from the Generalitat of Catalonia to A.L.-d.R. (DI 2016-080). This work was also supported in part within the framework of the Ministerio de Economía, Industria y Competitividad (MINECO) with grants TEC2014–60337–R and TEC2017 DPI2017-89827-R, and the Centro de Investigación Biomédica en Red (CIBER) of Bioengineering, Biomaterials and Nanomedicine, an initiative of the Instituto de Salud“ Carlos III” (ISCIII).


Email Address of Submitting Author


Universitat Politecnica de Catalunya and Mind the Byte SL



ORCID For Submitting Author


Declaration of Conflict of Interest

ALR, ANC and DV are affiliated with Mind the Byte SL, a company that develops and provides solutions for computational drug discovery using Big Data and Artificial Intelligence approaches.

Version Notes

First version