These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
Preprints are manuscripts made publicly available before they have been submitted for formal peer review and publication. They might contain new research findings or data. Preprints can be a draft or final version of an author's research but must not have been accepted for publication at the time of submission.
Recent applications of Recurrent Neural Networks enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1 % of the database) reproduces 68.9 % of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using log-likelihood plots. Furthermore, we use a mathematical model based on the “coupon collector problem” that compares the trained model to an upper bound, which shows that complex molecules with many rings and heteroatoms are more difficult to sample. We also suggest that the metrics obtained from this analysis can be used as a tool to benchmark any molecular generative model.