Abstract
Plants have a complex chemo-diversity and represent a reservoir of potential new therapeutic agents. Within a Swiss research project, six scientific research groups from different disciplines are collaborating to investigate a collection of more than 17’000 unique dried plant extracts. It aims to find new bioactive molecules and their modes of action, with for example anti-infective or pro-metabolic activities.
One of the main challenges of this enterprise is the management, integration and sharing of the highly heterogeneous data that are produced by the different research groups. Among these we find (i) massive high-resolution mass spectrometry data, (ii) the numerical results of innovative chemo-informatics methods, (iii) bioassay results from experimental models of tuberculosis and obesity, and (iv) organic synthetic chemistry. Additionally, requirements for data management plan and open-source science with the FAIR principles must be met.
We have established an agile pipeline to capture and structure this heterogeneous data into an RDF graph. The data content's gradual expansion and evolution throughout the project presented considerable challenges, particularly in terms of data modeling. Additionally, despite many collaborators not being RDF experts, most were technically adept at producing RDF triples relevant to their contributions.
We have deployed multiple instances of a triplestore and developed an in-house custom tool (i.e. KGSteward) to synchronize their content, based on a configuration file, which is centrally managed and version-controlled using Git. This strategy gave us the flexibility required to address global project challenges in common data management effectively.