MolGen-Transformer: A molecule language model for the generation and latent space exploration of pi-conjugated molecules

25 February 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The application of generative artificial intelligence (AI) to molecular discovery has unlocked vast potential for the automated design of new chemical systems. Molecular language models (LM), however, face several challenges that impact their effectiveness, including incomplete coverage of chemical space, due to limitations in training dataset diversity and size, chemical insights within latent space representations, and reconstruction reliability. Here, we present MolGen-Transformer, a generative AI model designed to address these challenges through a latent-space-centered approach. Trained on a large and diverse dataset of approximately 198 million organic molecules, the model achieves 100% molecular reconstruction accuracy, ensuring stable and reliable latent space representations. MolGen-Transformer leverages robust molecular encoding (here, the SELFIES representation) to guarantee valid outputs, enhance computational efficiency, and create a chemically meaningful latent space. To demonstrate the model’s capabilities, we develop and employ three sampling strategies: (1) production of diverse molecules through random latent space sampling, (2) generation of chemically similar molecules with tunable similarity and diversity, and (3) interpolation to identify chemical intermediates between target molecules, to provide insights into the continuity of the latent space. These methods enable flexible exploration of chemical space while addressing limitations of existing approaches. Combining accuracy and scalability, the MolGen-Transformer provides a versatile platform for generating chemically relevant and structurally diverse molecular data. To promote further innovation and facilitate new opportunities for AI-driven molecular discovery, both the model and sampling methods are publicly available.

Keywords

machine learning
generative AI

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
This Supporting Information document provides supplementary analysis and additional results to support the findings presented in the main text. It includes detailed examinations of the dataset used in training the MolGen-Transformer, such as distribution and atom count analyses, as well as further examples of local molecular generation and molecular evolution.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.