68 million natural product-like compound database generated via molecular language processing

Dillon Tay Wei Peng; Naythan Yeo Zhen Xi; Krishnan Adaikkappan; Yee Hwee Lim; Shi Jun Ang

doi:10.26434/chemrxiv-2023-wmgwn

Organic Chemistry

Search within Organic Chemistry

68 million natural product-like compound database generated via molecular language processing

12 January 2023, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Natural products are a rich resource of bioactive compounds for valuable applications across multiple fields such as food, agriculture, medicine. For natural product discovery, high throughput in silico screening offers a cost-effective alternative to traditional resource-heavy assay-guided exploration of structurally novel chemical space. In this data descriptor, we report a characterized database of 68,113,839 natural product-like molecules generated using a recurrent neural network trained on known natural products, demonstrating a significant 167-fold expansion in library size over the currently estimated 406,919 natural products known. This study highlights the potential of using deep generative models to uncover novel natural product chemical space for high throughput in silico screening toward natural product discovery.

Keywords

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

67 million natural product-like compound database generated via molecular language processing

Dillon W. P. Tay, Naythan Z. X. Yeo, Krishnan Adaikkappan, Yee Hwee Lim, Shi Jun Ang journal article

Scientific Data , Volume 10, Issue 1

Online publication date: May 19, 2023

Version History

Jan 12, 2023 Version 1

Metrics

1,949

689

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2023-wmgwn

Funding

Agency for Science, Technology and Research

#21719

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

68 million natural product-like compound database generated via molecular language processing

Authors

Abstract

Keywords

Comments

Now Published

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share