Abstract
Large language models (LLMs) have shown promising potential across diverse chemistry tasks, including forward reaction prediction, retrosynthesis, and property prediction. However, their ability to capture the intrinsic chemistry of molecules remains unclear. To study this, we evaluate the consistency of state-of-the-art LLMs when using different molecular representations, such as SMILES strings and IUPAC names. Our results reveal strikingly low consistency rates of below 1% for commercial state-of-the-art LLMs. To cope with the imbalance in molecular representation in the training data, we finetune the models using data represented in both SMILES and IUPAC, but the models still produce inconsistent predictions. To address this, we regularize training by a sequence-level, symmetric Kullback-Leibler (KL) divergence loss. Although the proposed KL divergence loss improves surface-level consistency, it does not lead to better accuracy, due to the apparent orthogonality between consistency and accuracy, suggesting that these models do not understand chemistry, as we expect them to. These findings point to the inherent limitations of recent LLMs and the need for more advanced approaches that encourage these LLMs to capture intrinsic chemistry, resulting in both accurate and consistent predictions.
Supplementary weblinks
Title
Consistency dataset
Description
This dataset was curated based on the LlaSMol dataset, reaction prediction, and property prediction subsets. We augmented the original dataset by translating SMILES string representation into IUPAC name representation. The resulting dataset consists of one-to-one mapped SMILES and IUPAC representations.
Actions
View Title
Code
Description
This code base contains all necessary files to reproduce the results in the manuscript.
Actions
View