Definition and exploration of realistic chemical spaces using the connectivity and cyclic features of ChEMBL and ZINC.

05 December 2022, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Discovering an efficient new molecule can have a huge impact on a chemical research field. For several problems, the current knowledge is too scarce to train robust deep learning models. An exploratory approach can be a solution. However, when we consider several types of atoms a phenomenal amount of combinations are possible even for small molecules. Many of these combinations contain very exotic associations. In addition to connectivity features filtering (based on ECFP4), we introduce a new filter based on the cyclic features. In this article, we show that whitelists including all connectivity and cyclic features of either ChEMBL or ChEMBL and ZINC allow for the definition of large realistic chemical spaces. An enumeration dataset, Evo10 has been built with more than 600 000 molecules with the set C, N, O, F and S having 10 or fewer heavy atoms. Starting only from a methane molecule, we were able to navigate through the chemical space of those realistic molecules and rediscover all molecules passing these same filters from the reference datasets which are here ChEMBL, ZINC, QM9, PC9, GDB11 and GDBChEMBL. The distributions of SAscores, CLscores and RAscores for all the generated molecules confirms that the vast majority of them seem realistic. It is especially the visualisation of the proposed top solutions after filtering for the optimisation of the QED or HOMO and LUMO energies, that convinces us of the relevance of this approach for the systematic de novo generation of realistic solutions.

Keywords

chemical space

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.