Inconsistency of LLMs in Molecular Representations

Bing Yan; Angelica Chen; Kyunghyun Cho

doi:10.26434/chemrxiv-2024-lnvbz

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Inconsistency of LLMs in Molecular Representations

16 December 2024, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Large language models (LLMs) have shown promising potential across diverse chemistry tasks, including forward reaction prediction, retrosynthesis, and property prediction. However, their ability to capture the intrinsic chemistry of molecules remains unclear. To study this, we evaluate the consistency of state-of-the-art LLMs when using different molecular representations, such as SMILES strings and IUPAC names. Our results reveal strikingly low consistency rates of below 1% for commercial state-of-the-art LLMs. To cope with the imbalance in molecular representation in the training data, we finetune the models using data represented in both SMILES and IUPAC, but the models still produce inconsistent predictions. To address this, we regularize training by a sequence-level, symmetric Kullback-Leibler (KL) divergence loss. Although the proposed KL divergence loss improves surface-level consistency, it does not lead to better accuracy, due to the apparent orthogonality between consistency and accuracy, suggesting that these models do not understand chemistry, as we expect them to. These findings point to the inherent limitations of recent LLMs and the need for more advanced approaches that encourage these LLMs to capture intrinsic chemistry, resulting in both accurate and consistent predictions.

Keywords

language models

consistency

Supplementary weblinks

Title

Description

Actions

Title

Consistency dataset

Description

This dataset was curated based on the LlaSMol dataset, reaction prediction, and property prediction subsets. We augmented the original dataset by translating SMILES string representation into IUPAC name representation. The resulting dataset consists of one-to-one mapped SMILES and IUPAC representations.

Actions

View

Title

Code

Description

This code base contains all necessary files to reproduce the results in the manuscript.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Dec 16, 2024 Version 1

Metrics

1,128

582

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2024-lnvbz

Funding

Samsung Advanced Institute of Technology

Next Generation Deep Learning: From Pattern Recognition to AI

National Science Foundation

1922658

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Inconsistency of LLMs in Molecular Representations

Authors

Abstract

Keywords

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share