Leveraging Large Language Models for Predictive Chemistry

Kevin Maik Jablonka; Philippe  Schwaller; Andres  Ortega-Guerrero; Berend Smit

doi:10.26434/chemrxiv-2023-fw8n4-v3

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Leveraging Large Language Models for Predictive Chemistry

17 October 2023, Version 3

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Machine learning has revolutionized many fields and has recently found applications in chemistry and materials science. The small datasets commonly found in chemistry sparked the development of sophisticated machine-learning approaches that incorporate chemical knowledge for each application and, therefore, require much expertise to develop. Here, we show that large language models trained on vast amounts of text extracted from the internet can easily be adapted to solve various tasks in chemistry and materials science by fine-tuning them to answer chemical questions in natural language with the correct answer. We compared this approach with dedicated machine-learning models for many applications spanning properties of molecules and materials to the yield of chemical reactions. Surprisingly, this approach performs comparable to or even outperforms the conventional techniques---particularly in the low data limit. In addition, we can perform inverse design successfully by simply inverting the questions. The high performance, especially for small data sets, combined with the ease of use, can fundamentally impact how we leverage machine learning in the chemical and material sciences. Next to a literature search, querying a foundation model might become a routine way to bootstrap a project by leveraging the collective knowledge encoded in these foundation models or to provide a baseline for predictive tasks.

Keywords

Supplementary materials

Title

Description

Actions

Title

Supplementary Information

Description

Additional experiments and information.

Actions

Supplementary weblinks

Title

Description

Actions

Title

GitHub repository

Description

Contains code used for the experiments

Actions

View

Title

GitHub repository

Description

Contains code used for the experiments

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Oct 17, 2023 Version 3

May 16, 2023 Version 2

Feb 14, 2023 Version 1

Version Notes

Incorporated review comments

Metrics

50,861

15,120

Views

Downloads

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2023-fw8n4-v3

Funding

MARVEL National Centre for Competence in Research

51NF40-182892

NCCR Catalysis

180544

The Grantham Foundation for the Protection of the Environment

Carl-Zeiss-Stiftung

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Leveraging Large Language Models for Predictive Chemistry

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share