Theoretical and Computational Chemistry

Assessment of chemistry knowledge in large language models that generate code

Authors

Abstract

In this work, we investigate the question: do code-generating large language models know chemistry? Our results indicate, mostly yes. To evaluate this, we produce a benchmark set of problems, and evaluate these models based on correctness of code by automated testing and evaluation by experts. We find recent LLMs are able to write correct code across a variety of topics in chemistry and their accuracy can be increased by 30 percentage points via prompt engineering strategies, like putting copyright notices at the top of files. These dataset and evaluation tools are open source which can be contributed to or built upon by future researchers, and will serve as a community resource for evaluating the performance of new models as they emerge. We also describe some good practices for employing LLMs in chemistry. The general success of these models demonstrates that their impact on chemistry teaching and research is poised to be enormous.

Version notes

Added additional models for comparison, expanded discussion of differences between models, and various minor clarifications.

Content

Thumbnail image of NLCC_Data-4.pdf

Supplementary material

Thumbnail image of contexts.yml
Contexts
Contexts used for prompt engineering
Thumbnail image of NLCC-data-automated.csv
Raw data for automated evaluation
Accuracy data for automated evaluation used to generated figures
Thumbnail image of NLCC-data-evaluation-submission.csv
Raw data for expert evaluation
Accuracy data for expert evaluation used to generated figures
Thumbnail image of NLCC_Data-5.pdf
Supporting Information
Additional figures, tables, and analysis.

Supplementary weblinks

Associated Data for Assessment of chemistry knowledge in large language models that generate code
Contains website of completions for human evaluable prompts for Codex model and shows how completions were presented to evaluators.
Database of Prompts
All prompts and analysis code used for paper