Automatic extraction of FAIR data from publications using LLM

Luc Patiny; Guillaume Godin

doi:10.26434/chemrxiv-2023-05v1b-v2

Organic Chemistry

Search within Organic Chemistry

Automatic extraction of FAIR data from publications using LLM

04 December 2023, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Since the beginning of modern science, researchers have used a specific format to communicate their findings in a standardized language. Such formats help to ensure that results can be replicated and published. With the rise of digitalization, artificial intelligence has become increasingly important in combination with the scientific literature sources of data. This synergy serves as a foundation of robust models following central principles of FAIR (Findable, Accessible, Interoperable, Reusable) data. By having access to more precise data, it is reasonable to anticipate the development of improved models. Specifically, large neural networks have demonstrated a high level of responsiveness to the quality of the data used. Therefore, enhancing the data quality can potentially lead to a reduction in the size of neural networks. Large Language Models (LLMs) have proven to be incredibly effective at replicating human tasks. This is a significant improvement that not only automatizes process but also leads to better results. By combining human and LLM assistance, we can produce higher-quality content and solve repetitive tasks that would otherwise take years to complete. Those generative AI assistants can follow instructions to transform and extrapolate existing text. Our contribution outlines a method for automatically extracting experimental data of molecules from literature. Essentially by our prompt engineering, we demonstrate that this process can be made more cost-effective. Secondly, we use automated fact checking principles to ensure the original data quality as well as the data retrieval by LLM. Ultimately, our aim is to provide guidance for the publication of organic chemical experimental data to assist researchers and enhance FAIR data.

Keywords

Publishing experimental review

Supplementary weblinks

Title

Description

Actions

Title

visualizer: FAIR data extraction and fact check

Description

this allow users to explore and analyse LLM process to provide FAIR data extracted from Molecules journal. It includes all data, logs and metrics provided in this paper.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Dec 04, 2023 Version 2

Nov 17, 2023 Version 1

Version Notes

We have added more detail analysis and run all the Molecules Journal extraction, as well as a random call analysis. Close to 100 extractions were done.

Metrics

4,427

2,525

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2023-05v1b-v2

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Automatic extraction of FAIR data from publications using LLM

Authors

Abstract

Keywords

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share