Noise Analysis and Data Refinement for Chemical Reactions from US Patents via Large Language Models

30 October 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The extraction of chemical reactions from U.S. Patent and Trademark Office (USPTO) documents has enabled significant advancements in machine learning models for organic synthesis. While the USPTO dataset offers a large and diverse collection of reaction data, recent studies have identified issues such as inconsistent or missing chemical entries, impacting data quality. To address these challenges, we employed fine-tuned large language models (LLMs) to revisit experimental sections in the US patents, performing a comprehensive analysis of noisy reaction data. Our findings demonstrate that LLMs produce fewer false reactions compared to existing datasets and reveal that many reactions in US patents involve multiple experimental steps, previously overlooked by standard extraction methods. Our analysis suggests that untraceable references and erroneous chemical names are primary sources of data noise. We also identify reaction types with high susceptibility to these issues, recommending scientists avoid using those high-risk reaction data.

Keywords

Large language model
reaction dataset

Supplementary materials

Title
Description
Actions
Title
Supplementary Information
Description
Supplementary information for this study
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.