Abstract
The extraction of chemical reactions from U.S. Patent and Trademark Office (USPTO) documents has enabled significant advancements in machine learning models for organic synthesis. While the USPTO dataset offers a large and diverse collection of reaction data, recent studies have identified issues such as inconsistent or missing chemical entries, impacting data quality. To address these challenges, we employed fine-tuned large language models (LLMs) to revisit experimental sections in the US patents, performing a comprehensive analysis of noisy reaction data. Our findings demonstrate that LLMs produce fewer false reactions compared to existing datasets and reveal that many reactions in US patents involve multiple experimental steps, previously overlooked by standard extraction methods. Our analysis suggests that untraceable references and erroneous chemical names are primary sources of data noise. We also identify reaction types with high susceptibility to these issues, recommending scientists avoid using those high-risk reaction data.
Supplementary materials
Title
Supplementary Information
Description
Supplementary information for this study
Actions