Large Language Models as Molecular Design Engines

Debjyoti Bhattacharya; Harrison Cassady; Michael Hickner; Wesley Reinhart

doi:10.26434/chemrxiv-2024-n0l8q-v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Large Language Models as Molecular Design Engines

21 May 2024, Version 2

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The design of small molecules is crucial for technological applications ranging from drug discovery to energy storage. Due to the vast design space available to modern synthetic chemistry, the community has increasingly sought to use data-driven and machine learning approaches to navigate this space. Although generative machine learning methods have recently shown potential for computational molecular design, their use is hindered by complex training procedures, and they often fail to generate valid and unique molecules. In this context, pre-trained Large Language Models (LLMs) have emerged as potential tools for molecular design, as they appear to be capable of creating and modifying molecules based on simple instructions provided through natural language prompts. In this work, we show that the Claude 3 Opus LLM can read, write, and modify molecules according to prompts, with an impressive 97% valid and unique molecules. By quantifying these modifications in a low-dimensional latent space, we systematically evaluate the model’s behavior under different prompting conditions. Notably, the model is able to perform guided molecular generation when asked to manipulate the electronic structure of molecules using simple, natural-language prompts. Our findings highlight the potential of LLMs as powerful and versatile molecular design engines.

Keywords

large language models

molecular design

computational chemistry

quantum chemistry

parametric method 7 (pm7)

SMILES

Supplementary materials

Title

Description

Actions

Title

Supplementary Information (SI) for “Large Language Models as Molecular Design Engines”

Description

Supplementary Information (SI) of the paper, having additional metrics that were recorded and supporting figures to accompany the primary manuscript.

Actions

Supplementary weblinks

Title

Description

Actions

Title

Dataset for "Large Language Models as Molecular Design Engines"

Description

Comprises of Data (claude-gpt-paper.zip), Codes (claude-gpt-paper-codes.zip ), and Molecular Viewer app (llm-visulizer-dashapp.zip) for viewing the molecules generated by the Large Language Model.

Actions

View

Title

GitHub Repository for the codes used in the paper "Large Language Models as Molecular Design Engines"

Description

Has a simple Jupyter notebook (GPT_modification_just_plots.ipynb), that can be run on Google Colab to generate the figures shown in the paper. More information on how to run this notebook is provided as markdown cells in the notebook and in the readme file in the GitHub repository.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Aug 29, 2024 Version 3

May 21, 2024 Version 2

May 21, 2024 Version 1

Version Notes

The Supplementary Information (SI) has been updated to include the names and affiliations of the authors.

Metrics

3,233

1,698

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2024-n0l8q-v2

Funding

Basic Energy Sciences

DE-AC05-00OR22725

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Large Language Models as Molecular Design Engines

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share