Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model

08 April 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and due to the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we leverage the power of fine-tuned large language models (LLMs) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.

Keywords

NLP
LLM
Open Reaction Database
Structured Data Extraction

Supplementary materials

Title
Description
Actions
Title
Supporting Information for "Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model"
Description
Supporting Information for "Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model"
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.