ChemRxiv
These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
exploring_gdb13_chemical_space_deep_generative_models.pdf (1.31 MB)
0/0

Exploring the GDB-13 Chemical Space Using Deep Generative Models

preprint
submitted on 07.10.2018 and posted on 09.10.2018 by Josep Arús-Pous, Thomas Blaschke, Jean-Louis Reymond, Hongming Chen, Ola Engkvist
Recent applications of Recurrent Neural Networks enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1 % of the database) reproduces 68.9 % of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using log-likelihood plots. Furthermore, we use a mathematical model based on the “coupon collector problem” that compares the trained model to an upper bound, which shows that complex molecules with many rings and heteroatoms are more difficult to sample. We also suggest that the metrics obtained from this analysis can be used as a tool to benchmark any molecular generative model.

Funding

Marie Skłodowska-Curie grant agreement no. 676434, “Big Data in Chemistry” (“BIGCHEM,” http://bigchem.eu)

History

Email Address of Submitting Author

josep.arus@dcb.unibe.ch

Institution

University of Bern

Country

Switzerland

ORCID For Submitting Author

0000-0002-9860-2944

Declaration of Conflict of Interest

No conflict of interest.

Licence

Exports