Abstract
Despite the vast number of enzymatic kinetic measurements reported across decades of biochemical literature, the majority of relational enzyme kinetic data—linking amino acid sequence, substrate identity, kinetic parameters, and assay conditions—remains uncollected and inaccessible in structured form. This constitutes a significant portion of the “dark matter” of enzymology. Unlocking these hidden data through automated extraction offers an opportunity to expand enzyme dataset diversity and size, critical for building accurate, generalizable models that drive predictive enzyme engineering. To address this limitation, we built EnzyExtract, a large language model-powered pipeline that automates the extraction, verification, and structuring of enzyme kinetics data from scientific literature. By processing 137,892 full-text publications (PDF/XML), EnzyExtract curated more than 218,810 enzyme–substrate–kinetics entries, including 218,770 kcat and 169,459 KM values. EnzyExtract identified 94,576 unique kinetic entries (kcat and KM combined) absent from BRENDA, significantly expanding the known enzymology dataset. The newly curated dataset was compiled into a database named EnzyExtractDB. EnzyExtract demonstrates high accuracy when benchmarked against manually curated datasets and strong consistency with BRENDA-derived data. To create model-ready datasets, enzyme and substrate sequences were aligned to UniProt and PubChem, yielding 85,980 high-confidence, sequence-mapped kinetic entries. To assess the practical utility of our dataset, we retrained several state-of-the-art kcat predictors (including MESI, DLKcat, and TurnUp) using EnzyExtractDB. Across held-out test sets, all models demonstrate improved predictive performance in terms of RMSE, MAE, and R², highlighting the value of high-quality, large-scale, literature-derived EnzyExtractDB for enhancing predictive modeling of enzyme kinetics.
Supplementary materials
Title
SI Finding the Dark Matter: Large Language Model-based Enzyme Kinetic Data Extractor and Its Validation
Description
Supporting Information. Table S1 reports the performance benchmarking of EnzyExtract. Text
S1 lists the keywords used for literature search, and Text S2 describes the prompt template
employed in EnzyExtract. Text S3 provides details on the extraction and comparison methods,
while Text S4 outlines the dataset splitting strategies used in TurNuP, MESI, and DLKcat. Figure
S1 shows the temperature and pH distribution in EnzyExtractDB; Figure S2 presents the
distribution of kcat and Km values; Figure S3 illustrates the Morgan fingerprint diversity; and Figure
S4 highlights the mutation diversity across the database.
Actions