DBgen: A Python Library for Defining Scalable, Maintainable, Accessible, Reconfigurable, Transparent (SMART) Data Pipelines

In this work, we present DBgen, a Python library that provides a framework for defining extract-transform-load (ETL) pipelines to create and populate SQL databases. DBgen is most useful when the underlying data has complex relationships, requires multi-step analysis, is large-scale, and the type of data being collected changes frequently. Scientific data often fits this description. With current tooling, defining ETL pipelines for this particularly difficultto-manage data is so onerous that a great deal of it does not end up being stored in a database and is opaque. DBgen is designed to fill the gap in the current tooling and reduce the barrier to defining ETL pipelines such data.

(SMART). In section 1.1, we describe how the SMART principles address the 20 challenges in defining scientific ETL pipelines. Then, in section 1.2, we iden-21 tify a gap in the current tooling which makes it difficult to define SMART 22 ETL pipelines. In section 2, we describe what DBgen is and how it helps to 23 fill this gap in the current tooling.  Maintainable: The fundamental relationships between the entities of in-29 terest in scientific data are very complex. There are many of types of entities 30 that need to be tracked, and the relations between these entities are often-31 times many-to-many. There is also a great deal of meta-data that needs to 32 be stored and formally linked to the data. This makes the FAIR principles 33 of rich meta-data and strong provenance challenging to achieve. A database 34 architecture capable of capturing this complexity requires many tables and 35 foreign keys. Accordingly, the ETL pipeline that populates such a database 36 is a fairly complex piece of software. As with any complex piece of software, 37 maintainability is crucial, and making the code modular and easy to debug 38 is imperative for maintainability.        Generators: Each Generator, defines a single extract-transform-load (ETL) step. It consists of a Query, a PyBlock, and Loads, which represent the ex-117 tract, transform, and load steps, respectively. A common pattern in DBgen 118 data pipelines is to query the database, use that information as inputs to a 119 function, and insert the result back into the database. This allows for com-120 plicated, multi-step data analyses to be carried out in a flexible, modular, 121 maintainable, and transparent manner.

122
Query: The Query object in DBgen defines a SQL query. By using a 123 Query object rather than a raw SQL string, the Generator knows which 124 database columns need to be queried in order for that ETL step to run.

125
DBgen will later use this information to compute dependencies among the 126 ETL steps and automatically run them in the correct order. inputs to the function come from the Query object, and the outputs from 131 the function are inserted back into the database in the load step, which is 132 described next. Pyblocks can also read data in from the file system, which 133 is commonly used for the early ETL steps in the data pipeline.    The underlying data that is now in the ESAMP database was originally 249 stored in a large zip archive containing additional zip archives containing 250 tens of thousands of automatically-generated, custom-structured text files.

251
The data pipeline that extracts the data from that structure and populates   In this example, process data has been stored within the process data Entity  changes are made, such as the modification of the relationship between pro-302 cess data and peaks from a one-to-many to a many-to-many relationship,   This is not because it wouldn't be useful, but rather because it is difficult 320 to achieve for the reasons mentioned in section 1. We believe that these 321 databases do not exist today because the software tools needed to generate 322 these complicated databases and corresponding ETL pipelines does not exist.

323
DBgen aims to fill that need.

325
In this work, we present DBgen, a python library that adds useful ab-326 stractions to the process of defining complex databases and ETL pipelines 327 and thereby reduces the barrier to storing data in accordance with the FAIR 328 principle. We also present a set of principles (SMART) that ETL pipelines 329 should ideally abide by, analogous to the FAIR principles for data storage. 330 We use materials science R&D data as an example of an inherently complex 331 data source. We show that modeling the data without making assumptions 332 demands a complicated database architecture, which would be difficult to 333 create without DBgen. We show that modeling the data in this way adds 334 transparency and flexibility to dataset curation. Altogether, we provide evi-335 dence that DBgen is a useful tool that greatly reduces the barrier to storing 336 scientific data in accordance with the widely accepted FAIR principles.