Recent applications of Recurrent Neural Networks enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1 % of the database) reproduces 68.9 % of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using log-likelihood plots. Furthermore, we use a mathematical model based on the “coupon collector problem” that compares the trained model to an upper bound, which shows that complex molecules with many rings and heteroatoms are more difficult to sample. We also suggest that the metrics obtained from this analysis can be used as a tool to benchmark any molecular generative model.
fixed typographic mistakes; fixed references; fixed author list; improved figure images; added graphical abstract