68 million natural product-like compound database generated via molecular language processing

12 January 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.


Natural products are a rich resource of bioactive compounds for valuable applications across multiple fields such as food, agriculture, medicine. For natural product discovery, high throughput in silico screening offers a cost-effective alternative to traditional resource-heavy assay-guided exploration of structurally novel chemical space. In this data descriptor, we report a characterized database of 68,113,839 natural product-like molecules generated using a recurrent neural network trained on known natural products, demonstrating a significant 167-fold expansion in library size over the currently estimated 406,919 natural products known. This study highlights the potential of using deep generative models to uncover novel natural product chemical space for high throughput in silico screening toward natural product discovery.


Natural product
machine learning
generative model
molecular design


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.