Automatic extraction of FAIR data from publications using LLM

04 December 2023, Version 2
This content is a preprint and has not undergone peer review at the time of posting.


Since the beginning of modern science, researchers have used a specific format to communicate their findings in a standardized language. Such formats help to ensure that results can be replicated and published. With the rise of digitalization, artificial intelligence has become increasingly important in combination with the scientific literature sources of data. This synergy serves as a foundation of robust models following central principles of FAIR (Findable, Accessible, Interoperable, Reusable) data. By having access to more precise data, it is reasonable to anticipate the development of improved models. Specifically, large neural networks have demonstrated a high level of responsiveness to the quality of the data used. Therefore, enhancing the data quality can potentially lead to a reduction in the size of neural networks. Large Language Models (LLMs) have proven to be incredibly effective at replicating human tasks. This is a significant improvement that not only automatizes process but also leads to better results. By combining human and LLM assistance, we can produce higher-quality content and solve repetitive tasks that would otherwise take years to complete. Those generative AI assistants can follow instructions to transform and extrapolate existing text. Our contribution outlines a method for automatically extracting experimental data of molecules from literature. Essentially by our prompt engineering, we demonstrate that this process can be made more cost-effective. Secondly, we use automated fact checking principles to ensure the original data quality as well as the data retrieval by LLM. Ultimately, our aim is to provide guidance for the publication of organic chemical experimental data to assist researchers and enhance FAIR data.


FAIR data
experimental data
Prompt engineering
fact checking
Publishing experimental review

Supplementary weblinks


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.