Abstract
Discovering an efficient new molecule can have a huge impact on a chemical research field. For several
problems, the current knowledge is too scarce to train robust deep learning models. An exploratory
approach can be a solution. However, when we consider several types of atoms a phenomenal amount
of combinations are possible even for small molecules. Many of these combinations contain very exotic
associations. In addition to connectivity features filtering (based on ECFP4), we introduce a new
filter based on the cyclic features. In this article, we show that whitelists including all connectivity
and cyclic features of either ChEMBL or ChEMBL and ZINC allow for the definition of large realistic
chemical spaces. An enumeration dataset, Evo10 has been built with more than 600 000 molecules
with the set C, N, O, F and S having 10 or fewer heavy atoms. Starting only from a methane molecule,
we were able to navigate through the chemical space of those realistic molecules and rediscover all
molecules passing these same filters from the reference datasets which are here ChEMBL, ZINC,
QM9, PC9, GDB11 and GDBChEMBL. The distributions of SAscores, CLscores and RAscores for all
the generated molecules confirms that the vast majority of them seem realistic. It is especially the
visualisation of the proposed top solutions after filtering for the optimisation of the QED or HOMO
and LUMO energies, that convinces us of the relevance of this approach for the systematic de novo
generation of realistic solutions.