EnzymeML – a data exchange format for biocatalysis and enzymology

Abstract


Introduction
Enzyme catalysis and enzymology provide a powerful toolbox for sustainable synthesis routes and innovative solutions for bio-based chemistry.A better understanding of cellular biochemistry and the comprehensive biochemical characterization of the desired enzymecatalyzed reaction enable novel approaches in enzyme engineering and process development. 1ndardization of reporting of enzymatic data and metadata is considered to be pivotal to accelerating bioprocess development and reducing costs 2 , facilitating sharing, analysis, and reuse of data and thus enabling quality control and reproducibility of experiments 3 .Therefore, a major challenge for enzymology and biocatalysis lies in the current practices of dealing with experimental data in academic laboratories 4 .In most academic research groups, data acquisition, curation, and documentation are performed manually without a universally accepted standard across laboratories.Data and metadata are typically stored in ad hoc repositories, such as paper lab notebooks, spreadsheets in different formats, and semi-structured text files containing custom annotations.Experimental or computational data is often poorly annotated, lacking a complete description of the acquisition and analysis procedures, or associated metadata.Despite previous efforts to address these issues 5 , raw data are rarely available in machine-readable, even less in machine-actable format, preventing their further analysis and third-party validation.As it stands, the process of data acquisition, data analysis, and documentation is time consuming and error-prone, as is the recovery and interpretation of legacy data in most academic laboratories.Consequently, both the quality and the completeness of data and metadata solely relies on the experimenter's expertise and care.
Meta-research studies suggest the lack of standardization to report and share experimental protocols, results and data as one of the causes of the reproducibility crisis in the biomedical sciences 6,7 .This is also true for enzymology and biocatalysis.An empirical analysis of published papers investigating enzyme function illustrates how critical information for the reproducibility of experimental finding is missing in the literature 8 ; the missing information includes the concentration of enzyme and/or substrates, the composition of the entire buffer systems including the identity of counter-ions, pH values and assay temperatures.
The incompleteness of metadata prevents the interpretation of inconsistent data arising from different studies.An example of such variability is demonstrated in a large global benchmark study 9 , in which the variability of a dissociation constant for a protein-protein interaction determined by 150 participants using a general protocol exceeded its average value.When investigators were given detailed fixed protocols, the dissociation constants still varied up to 20% 10,11 .This kind of irreproducibility is commonplace in enzymology and has an essential impact on subsequent research.
In response to the reproducibility crisis, the scientific community is developing and adopting new guidelines for reporting experimental protocols and statistical analysis.Scientific journals are responding accordingly 12 , and there has been a recommendation to modify the academic reward system by recognising scientists who aligned with best practices for reproducible research 13 .Initiatives such as the German National Research Data Infrastructure develop an infrastructure for standardised research data exchange 14 , the Standards in Laboratory Automation consortium (SiLA) provide a framework for the exchange, integration, sharing, and retrieval of electronic laboratory information (https://sila2.gitlab.io/sila_base/),and data repositories such as Zenodo and Dataverse enable data sharing 15 .Efforts in standardization and data reproducibility have been long established in other 'omics fields, with standard exchange formats for transcriptomics 16 , proteomics 17 , and metabolomics 18 data becoming increasingly developed and adopted over the last twenty years.However, in biocatalysis and enzymology exchange standards or software support to aid data analysis, management, and sharing is still absent, and raw experimental data such as the time dependency of substrate or product concentration, derived data such as kinetic parameters, and metadata such as reaction conditions or the kinetic model are typically reported in plain text, figures, or tables 19 .
Currently, kinetic parameters and corresponding information about the reactions, enzymes, and experimental conditions are extracted and annotated manually from scientific publications and inserted into databases such as SABIO-RK 20 or BRENDA 21 to structure and standardise the data.Missing information such as unambiguous external identifiers is added manually by database curators.As a first step for the standardised reporting of enzyme function data, the enzymology and biocatalysis community has established the Standards for Reporting Enzymology Data (STRENDA) Guidelines, which provide the minimum information necessary to describe assay conditions and enzyme activity data 22,23 .Currently, more than 55 international biochemistry journals have included adherence to the STRENDA Guidelines in their instructions for authors reporting enzymology data.STRENDA DB has been established as a public database to support authors checking the completeness of their data upon submission of their manuscript and to provide public access to data on reaction conditions and kinetic parameters of an experiment 24 .However, the upload of data is performed manually via a graphical user interface, and the process from data acquisition to kinetic modelling and publication is still time consuming and error prone.Most importantly, original data such as the measured time course of substrate and product concentrations is not reported or has to be extracted from figures, thus preventing the reuse of original data for kinetic modelling.Not only is published data incomplete and inaccessible, but also unpublished research data and metadata are stored by research group members with insufficient documentation and annotation.In addition, the current data management prevents researchers from upscaling their experimental designs to high-throughput biocatalytic approaches by using pipetting robots 25 or flow reactors 26 , and hinders the comprehensive study of the multidimensional parameter space of biocatalytic reactions.
Here, we introduce EnzymeML, a data exchange format for biocatalysis and enzymology, which makes enzyme data findable, accessible, interoperable, and reusable in accordance to the FAIR data principles 27 .An application programming interface (API) provides Python and Java libraries to integrate applications and databases and to enable a seamless data flow from the bench to kinetic modelling tools and publication platforms.The machine-actable EnzymeML document on data and metadata of an enzymatic reaction could serve as a micropublication, supplementing the respective scientific paper.

Principles of EnzymeML
EnzymeML has been designed to support data acquisition, data analysis, and sharing of data by providing a standardised exchange format for enzymatic data (Fig. 1).EnzymeML is written in eXtensible Markup Language (XML) and comprises the most relevant data and metadata from measurement and modelling.Given the ubiquity of XML, vast amounts of software are available that read, write, manipulate, and process XML documents.More importantly, XML allows for the specification of a machine-actable schema which ensures interoperability.The central core of EnzymeML is the Systems Biology Markup Language (SBML), an established data format in systems biology for sharing, evaluating, and developing models of biochemical reaction networks 28 .Interoperability with existing software tools and databases is achieved by applying a common terminology and vocabulary that allow the integration of data from various sources for subsequent processing, because many of the concepts supported by SBML -educts, products, reactions, modifiers, reaction rates -are common to enzymology and biocatalysis.However, EnzymeML goes beyond SBML, because it serves to describe the effect of enzyme sequence and reaction medium to an enzymatic reaction.
EnzymeML implements the STRENDA Guidelines: For the complete machine-actable description of an enzymatic experiment, the STRENDA Guidelines were incorporated.In addition, metadata on the experiments and the kinetic model were included, resulting in a comprehensive data exchange format that comprises 71 attributes (Tab.S1).The current version of EnzymeML includes all STRENDA fields with a controlled vocabulary or values and excludes fields with plain text such as experiment methodology, in order to make EnzymeML structured and machine actable.

EnzymeML was built within the framework of several internationally recognised standards:
SBML is a widely used XML-based markup language and describes almost 50% of the attributes (Tab.S1).MathML was applied to describe the equation of the kinetic model, 28 and the guidelines on Minimal Information Required in the Annotation of Models (MIRIAM) 29 were applied for the consistent annotation of components such as reactants, products, and enzymes, using terms from external data repositories such as ChEBI 30 and Uniprot 31 .A controlled, relational vocabulary of terms, the Systems Biology Ontology (SBO) 32 , was used to define reactants, inhibitors, activators, parameters, and the kinetic model.All files are combined into a single document using the OMEX format 33 .Furthermore, EnzymeML uses the Distributions package for SBML Level 3 (http://sbml.org/Documents/Specifications/SBML_Level_3/Packages/distrib) to support the specification of ranges of initial concentrations.
EnzymeML is extensible: EnzymeML-specific attributes are added to SBML using the "annotation" element, which supports metadata specific to enzymology to be added to the XML document whilst maintaining compatibility with SBML.EnzymeML documents are valid SBML files and can therefore be used and manipulated by many software tools that support the SBML format.
EnzymeML is platform independent: XML has been designed to store and transfer data, and is fully agnostic to the operating system and supported by different programming languages.
Comma-Separated Values (CSV) is a platform-independent text file format, which was designed for storing and transporting data structured in tables.CSV-formatted files can be read by the modelling platform COPASI 34 and by spreadsheet editors such as Excel.All components of EnzymeML are self-descriptive (SBML, MathML, OMEX), which makes EnzymeML human readable and machine actable.
EnzymeML is modular: EnzymeML was developed as a container for experimental and modelling data, supporting a seamless data flow between different applications (Fig. 2).Data obtained from an experiment and metadata on experimental conditions can be stored by the experimentalist in a spreadsheet, which is convertible into EnzymeML using the API.Longer term, it is hoped that electronic lab notebooks, laboratory information management systems, and enzymology software will support the format.The EnzymeML document contains sufficient experimental data to allow for the estimation of the kinetic parameters by modelling platforms such as COPASI 34 , BioCatNet 35 , or Matlab™.Kinetic parameters can then be included in the EnzymeML document.As a consequence, enzyme assay data may be easily reanalyzed and checked with a range of data fitting algorithms, increasing reusability and confidence in both the experimental data and reported kinetic parameters.
EnzymeML enables data publication in compliance with FAIR principles: An EnzymeML document stores comprehensive information about data and metadata of an enzymatic experiment: the experimental conditions, the time course of substrate and product concentration, the kinetic model, and the estimated kinetic parameters, thus making the experiment and its analysis reproducible.Upon publication, it is recommended to use EnzymeML documents as supplementary material.By depositing EnzymeML documents on platforms such as FAIRDOMHub 36 or Dataverse 37 using a digital object identifier, EnzymeML documents are findable and accessible.EnzymeML documents also include references to the scientific publications from which they arose, providing contextual information.

Structure of EnzymeML documents
An EnzymeML document is a ZIP container in the widely used OMEX format. 33It consists of three file types: a file using SBML to describe the experimental reaction conditions, the kinetic model, and the kinetic parameters, CSV (comma-separated values)-formatted files to store the time courses of substrate and product concentrations, and a manifest file lists the content of the ZIP container (Figure 1).
The experimental conditions are reported according to the STRENDA recommendations, the kinetic model is described by using MathML and SBML in the experiment file.This file also describes the format of the CSV-formatted file which contains the raw time course data.Instead of using headers to describe columns, the complete CSV-formatted file description is done within the SBML file.This approach has the advantage of enabling a comprehensive description of each column, such as measured species, units and data types, instead of a single header.The SBML file uses two elements, notes and annotation.A notes tag contains human-readable information as plain text, whereas an annotation tag contains structured, machine-actable information.Notes and annotation tags are used to add information which is required by the STRENDA Guidelines, but not included in SBML, such as protein sequence, pH, or temperature.Thus, this file is a valid SBML document, which contains additional information on enzyme-catalyzed reactions.An extensive description of the EnzymeML document structure is available in the Supporting Information.

EnzymeML application programming interface (API)
Although EnzymeML is semi-human-readable, the user is not expected to read or write EnzymeML documents directly, but to use software to generate EnzymeML documents, which can then be used as a standardised exchange format to transfer data between applications (Figure 2).APIs to read, write, edit, and visualise EnzymeML have therefore been developed, using the popular programming languages Python and Java, to support the development of such software tools.The library PyEnzyme was built based on its respective SBML counterpart libSBML.To simplify the implementation of the libraries for enzyme-catalyzed reactions, the terminology of enzymology and biocatalysis is used, hiding the more systems biology focused SBML terms, while maintaining full compatibility with the SBML format.
The adaption of the API to an application is enabled by an additional thin layer, which maps the objects of the API to the equivalent objects defined within the respective application.Thus, by editing a template, the functionality of reading and writing of EnzymeML can be easily incorporated into an application without the need to modify the API.For five applications (COPASI import/export, STRENDA-DB export, BioCatNet export, SABIO-RK import, simulation of time course data), application-specific thin API layers are provided (TL_COPASI, TL_STRENDAML and TL_BioCatNet, respectively).Because the API enables batch processing, management of enzymatic data is scalable, and high throughput strategies of experimentation and data analysis become feasible.By data export in formats such as Pandas DataFrame, large datasets could be analyzed by novel analysis methods based on machine learning.
Upon reading, writing, and visualization of EnzymeML documents, the API controls data completeness and consistency, such as checking the definition of reactants and proteins upon reading or writing of a reaction, or by checking that scalar properties such as pH are within the necessary range.A specific validation tool guarantees compatibility with SBML.Further application-specific validation tools have been added, such as a STRENDA DB validator to check for compatibility with the STRENDA Guidelines.For more details, readers can find a description of API below and the Supporting Information.

Application of EnzymeML
To illustrate the power of EnzymeML, we illustrate selected applications for experimental enzymologists, system biology modelers, and software developers.

Creating EnzymeML documents from structured spreadsheets
In the absence of a standard format, experimentalists typically store their experimental time course data in a spreadsheet following an ad hoc structure.Recently, a CSV-formatted spreadsheet, the BioCatNet template 35 , was proposed to store and report experimental data on enzyme-catalyzed reactions according to the STRENDA Guidelines.The API was used to convert the BioCatNet spreadsheet, containing time course data on substrate and product concentration and comprehensive information as the reaction conditions, to EnzymeML.
Initially, each field of the respective spreadsheet template was extracted via a thin API layer (TL_BioCatNet) and further processed by the API to an object layer.Finally, the objects were written to an EnzymeML document (see SI 3.1).

Creating EnzymeML documents from STRENDA DB entries
STRENDA DB is a database on enzyme-catalyzed reactions, which covers the most important information on reaction conditions and kinetic parameters. 24The API was used to create an EnzymeML document from a STRENDA DB entry via a STRENDA DB-specific thin API layer (TL_STRENDA) to the object layer using the PyEnzyme library.The resulting EnzymeML document was then created by the API (see SI 3.2).

Upload of EnzymeML documents to SABIO-RK
SABIO-RK is a curated database that contains information about biochemical reactions, their kinetic rate equations with parameters, and experimental conditions. 20An already existing SBML parser for the upload of SBML models in SABIO-RK was extended to read the additional annotations in EnzymeML to allow the import of EnzymeML documents and to create a new SABIO-RK entry in the internal curation interface (see SI 3.3).SABIO-RK curators check the new SABIO-RK entries for consistency and completeness according to the SABIO-RK requirements before they are finally submitted to the public SABIO-RK database.

Editing of EnzymeML: simulation of time course data from kinetic parameters
STRENDA-DB entries provide for an enzyme-catalyzed reaction the kinetic parameters KM and kcat assuming a Michaelis-Menten model and the concentration range of the substrate.
However, they are lacking information on the product and on the time course of substrate or product concentrations.PyEnzyme was used to add the product and time course data to the EnzymeML document (see SI 3.4).By a single function in the API, the time course of substrate concentrations was simulated from the kinetic parameters for initial concentrations from 0 to 0.5 mM for a time interval of 200 seconds to visualise kinetic behavior and study the effect of kinetic parameters

Kinetic modelling of EnzymeML data by COPASI
COPASI is a modelling and simulation environment, which supports the OMEX format. 34ng the PyEnzyme library and a COPASI-specific thin API layer (TL_COPASI), the time course data (measured concentrations of substrate or product) are loaded into COPASI.Within COPASI, different kinetic laws are applied, kinetic parameters are estimated, and plots are generated to assess the result.The selected kinetic model and the estimated kinetic parameters are then added to the EnzymeML document (see SI 3.5).

Outlook
For many years, researchers worldwide from various disciplines have recognised that data published in the literature is not reliable unless the full set of information required is provided 23 .
Therefore, the FAIR principles were introduced to encourage the comprehensive documentation of structured metadata in all stages of their life cycle in order to guarantee reproducibility of experiments and to enable reuse of results.A discipline-specific standard data exchange format such as EnzymeML therefore provides three functionalities to optimise research in biocatalysis and enzymology: it allows the experimentalist to collect data and metadata in a structured format for data analysis; it allows project partners to transfer data and metadata between different sites and different applications; and it enables findable and reusable publication and archiving of data and metadata 38 .
Currently, data flow from laboratory to publication is a challenging and complex process involving diverse processing stages, and numerous steps of data reformatting and manual input.
Such manual approaches are becoming increasingly unsustainable, especially in light of recent advances in miniaturization and robotics which have enabled the intensive, high-throughput screening of enzymes and process conditions. 39Such technological advances foster the discovery of novel enzymatic systems and the (retro-)synthetic design of enzyme-catalyzed reaction cascades through integration of systematic data acquisition, data analysis, and simulation. 40fully digitalised biocatalytic laboratory, an electronic lab notebook supports researchers at the bench to plan experiments and to collect experimental data and metadata, 41,42 all laboratory devices are connected by a common standard, 43 various modelling and data analysis tools are combined to analyze the data 34,35,44 , , and the results are uploaded to searchable repositories without manual intervention 24,20 .
With the integration of EnzymeML the interoperability and compatibility of the tools and databases will be improved, and possible current limitations and inconsistencies in the data models of the repositories will be resolved.In the future, EnzymeML will be combined with other standards to enrich the data model and to connect disciplines that are relevant to enzymology.Incorporating AniML 43 or SiLA enables access to laboratory devices, and ThermoML 42 offers a comprehensive description of the reaction medium.
The introduction of EnzymeML as a uniform transport container for experimental data and metadata, will encourage the development of software infrastructure built on this standardised format to greatly simplify the process of analyzing and publishing enzymology data, supporting the increasing experimental throughput, and ultimately promoting the digitalization of the fields of enzymology and biocatalysis 14 .