Abstract
The discovery and design of catalysts is accelerated by Artificial Intelligence (AI) nowadays, starting with a critical step of transforming the vast, unstructured synthesis literature into machine-actionable knowledge. Rule-based pipelines and early large language model (LLM) workflows are often constrained by shallow context windows and limited coverage of multi-section and cross-referenced data which is important for catalyst design and operation. We introduce CATDA (Corpus-aware Automated Text-to-Graph Catalyst Discovery Agent), an integrated framework that unifies LLM reasoning for formatted dataset curation and knowledge discovery. CatGraph is presented as a unified graph representation that captures multistep synthesis pathways, precursor properties, catalytic performance, and experimental conditions, extracted from full-text documents using the longest context window to date. Based on this structured graph, two downstream agents are further developed: CatAgent, which enables natural-language queries and DatasetAgent, which exports machine-learning-ready datasets. On our benchmark, the extracted datasets achieve near-human fidelity (F1 > 0.97), demonstrating the accuracy of our pipeline. Together, these tools form an integrated system that connects literature mining, knowledge discovery, and predictive modeling – supporting data-driven catalyst design and synthesis operation optimization for next-generation inorganic catalysts.
Supplementary materials
Title
Supporting Information
Description
Supporting Information
Actions