Integrating Machine Learning and Large Language Models to Advance Wu Exploration of Electrochemical Reactions

28 August 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Electrochemical C-H oxidation reactions offer a sustainable route to functionalize hydrocarbons, yet the identification of competent substrates and their synthesis optimization remains challenging. Here, we report an integrated approach combining machine learning (ML) and large language models (LLMs) to streamline the exploration of electrochemical C-H oxidation reactions. Utilizing a batch rapid screening electrochemical platform, we evaluated a wide range of reactions, initially classifying substrates by their reactivity, while LLMs text-mined literature data to augment the training set. The resulting ML models, one for reactivity prediction and the other one for site selectivity, both achieved high accuracy (>90%) and enabled virtual screening of a large set of commercially available molecules. To optimize reaction conditions of substrates of interest upon the screening, LLMs were prompted to generate code to iteratively improve yield, lowering the barrier for scientists to access ML programs, and this strategy efficiently identified high-yield conditions for eight drug-like substances or intermediates. Notably, we benchmarked the accuracy and reliability of 10 different LLMs, including llama, Claude, and GPT-4, on generating and executing codes related to ML based on natural language prompts given by chemists to showcase their tool-making and tool-using capabilities and potentials for accelerating research across four diverse tasks. In addition, we collected an experimental benchmark dataset comprising 1071 reaction conditions and yields for electrochemical C-H oxidation reactions, and our findings revealed that integrating LLMs and ML outperformed using either method alone. We envision that this combined approach offers a robust and generalizable pathway for advancing synthetic chemistry research

Keywords

Electrochemical C-H oxidation reactions
machine learning
large language models
batch screening

Supplementary materials

Title
Description
Actions
Title
Supplementary information
Description
Supporting Information. General experimental, characterization data, spectra, and computational methods
Actions
Title
SF1. Literature Screening Dataset
Description
Literature Screening Dataset
Actions
Title
SF2. EChem Reaction Screening Dataset
Description
E-Chem Reaction Screening Dataset
Actions
Title
SF3. Auto Coding Dataset
Description
Automated Coding Dataset
Actions
Title
SF4. EChem Reaction Optimization Dataset
Description
E-Chem Reaction Optimization Dataset
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.