A Data-Driven Reaction Discovery Strategy Based on Large Language Models

03 January 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The discovery of novel reactions and optimization of reaction conditions are fundamental challenges in organic synthesis, with significant implications for retrosynthetic analysis and condition selection. This work proposes a data-driven strategy for reaction discovery, integrating high-throughput experimentation (HTE) with insights derived from large language models (LLMs). By leveraging LLMs to process chemical information from extensive literature, the method enables hypothesis-driven design and experimental validation, minimizing reliance on serendipity. Taking cross-electrophile coupling (XEC) as a case study, this research extracts key trends, substrate combinations, and reaction conditions from 520 relevant publications. The methodology identifies unexplored substrate pairs and designs reaction plates for HTE, facilitating systematic discovery. Additionally, the concept of directed evolution in chemical catalysis is explored, hypothesizing that catalytic conditions can evolve systematically based on structural and reactivity similarities. The findings demonstrate the utility of combining LLMs with HTE for reaction discovery and catalysis research. This approach emphasizes methodology development, prioritizing the generation of hypotheses and protocols over isolated reaction discoveries, offering a scalable framework for advancing chemical innovation.

Keywords

High-throughput experimentation (HTE)
Large Language Model (LLM)
Cross Electrophile Coupling (XEC)

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
Full prompt, additional visualization analysis of conditions and reactivity information extracted
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.