These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
exploring_gdb13_chemical_space_deep_generative_models.pdf (1.31 MB)

Exploring the GDB-13 Chemical Space Using Deep Generative Models

submitted on 07.10.2018, 19:46 and posted on 09.10.2018, 12:29 by Josep Arús-Pous, Thomas Blaschke, Jean-Louis Reymond, Hongming Chen, Ola Engkvist
Recent applications of Recurrent Neural Networks enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1 % of the database) reproduces 68.9 % of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using log-likelihood plots. Furthermore, we use a mathematical model based on the “coupon collector problem” that compares the trained model to an upper bound, which shows that complex molecules with many rings and heteroatoms are more difficult to sample. We also suggest that the metrics obtained from this analysis can be used as a tool to benchmark any molecular generative model.


Marie Skłodowska-Curie grant agreement no. 676434, “Big Data in Chemistry” (“BIGCHEM,”


Email Address of Submitting Author


University of Bern



ORCID For Submitting Author


Declaration of Conflict of Interest

No conflict of interest.