Abstract
The application of generative artificial intelligence (AI) to molecular discovery has unlocked vast potential for the automated design of new chemical systems. Molecular language models (LM), however, face several challenges that impact their effectiveness, including incomplete coverage of chemical space, due to limitations in training dataset diversity and size, chemical insights within latent space representations, and reconstruction reliability. Here, we present MolGen-Transformer, a generative AI model designed to address these challenges through a latent-space-centered approach. Trained on a large and diverse dataset of approximately 198 million organic molecules, the model achieves 100% molecular reconstruction accuracy, ensuring stable and reliable latent space representations. MolGen-Transformer leverages robust molecular encoding (here, the SELFIES representation) to guarantee valid outputs, enhance computational efficiency, and create a chemically meaningful latent space. To demonstrate the model’s capabilities, we develop and employ three sampling strategies: (1) production of diverse molecules through random latent space sampling, (2) generation of chemically similar molecules with tunable similarity and diversity, and (3) interpolation to identify chemical intermediates between target molecules, to provide insights into the continuity of the latent space. These methods enable flexible exploration of chemical space while addressing limitations of existing approaches. Combining accuracy and scalability, the MolGen-Transformer provides a versatile platform for generating chemically relevant and structurally diverse molecular data. To promote further innovation and facilitate new opportunities for AI-driven molecular discovery, both the model and sampling methods are publicly available.
Supplementary materials
Title
Supporting Information
Description
This Supporting Information document provides supplementary analysis and additional results to support the findings presented in the main text. It includes detailed examinations of the dataset used in training the MolGen-Transformer, such as distribution and atom count analyses, as well as further examples of local molecular generation and molecular evolution.
Actions
Supplementary weblinks
Title
Transformer model and sampling methods
Description
The MolGen-Transformer model and three latent space sampling methods can be found at this site.
Actions
View