Theoretical and Computational Chemistry

Do large language models know chemistry?



Mostly yes. We systematically evaluate machine learning large language models (LLMs) that generate code in the context of chemistry. We produce a benchmark set of problems, and evaluate these models based on correctness of code by automated testing and evaluation by experts. We find recent LLMs are able to write correct code across a variety of topics in chemistry and their accuracy can be increased by 30 percentage points via prompt engineering strategies, like putting copyright notices at the top of files. These dataset and evaluation tools are open source which can be contributed to or built upon by future researchers, and will serve as a community resource for evaluating the performance of new models as they emerge. We also describe some good practices for employing LLMs in chemistry. The general success of these models demonstrates that their impact on chemistry teaching and research is poised to be enormous.


Thumbnail image of NLCC_Data.pdf

Supplementary material

Thumbnail image of contexts.yml
Contexts used for prompt engineering
Thumbnail image of NLCC-data-automated.csv
Raw data for automated evaluation
Accuracy data for automated evaluation used to generated figures
Thumbnail image of NLCC-data-evaluation-submission.csv
Raw data for expert evaluation
Accuracy data for expert evaluation used to generated figures
Thumbnail image of NLCC_Data.pdf
Supporting Information
Additional figures, tables, and analysis.

Supplementary weblinks

Associated Data for Do large language models know chemistry
Contains website of completions for human evaluable prompts for Codex model and shows how completions were presented to evaluators.
Database of Prompts
All prompts and analysis code used for paper