Abstract
In this work, we investigate the question: do code-generating large language models know chemistry? Our results indicate, mostly yes. To evaluate this, we produce a benchmark set of problems, and evaluate these models based on correctness of code by automated testing and evaluation by experts. We find recent LLMs are able to write correct code across a variety of topics in chemistry and their accuracy can be increased by 30 percentage points via prompt engineering strategies, like putting copyright notices at the top of files. These dataset and evaluation tools are open source which can be contributed to or built upon by future researchers, and will serve as a community resource for evaluating the performance of new models as they emerge. We also describe some good practices for employing LLMs in chemistry. The general success of these models demonstrates that their impact on chemistry teaching and research is poised to be enormous.
Supplementary materials
Title
Contexts
Description
Contexts used for prompt engineering
Actions
Title
Raw data for automated evaluation
Description
Accuracy data for automated evaluation used to generated figures
Actions
Title
Raw data for expert evaluation
Description
Accuracy data for expert evaluation used to generated figures
Actions
Title
Supporting Information
Description
Additional figures, tables, and analysis.
Actions
Supplementary weblinks
Title
Associated Data for Assessment of chemistry knowledge in large language models that generate code
Description
Contains website of completions for human evaluable prompts for Codex model and shows how completions were presented to evaluators.
Actions
View