Since the beginning of modern science, researchers have used a specific format to communicate their findings in a standardized language. Such formats help to ensure that results can be replicated and published. With the rise of digitalization, artificial intelligence has become increasingly important in combination with the scientific literature sources of data. This synergy serves as a foundation of robust models following central principles of FAIR (Findable, Accessible, Interoperable, Reusable) data. By having access to more precise data, it is reasonable to anticipate the development of improved models. Specifically, large neural networks have demonstrated a high level of responsiveness to the quality of the data used. Therefore, enhancing the data quality can potentially lead to a reduction in the size of neural networks. Large Language Models (LLMs) have proven to be incredibly effective at replicating human tasks. This is a significant improvement that not only automatizes process but also leads to better results. By combining human and LLM assistance, we can produce higher-quality content and solve repetitive tasks that would otherwise take years to complete. Those generative AI assistants can follow instructions to transform and extrapolate existing text. Our contribution outlines a method for automatically extracting experimental data of molecules from literature. Essentially by our prompt engineering, we demonstrate that this process can be made more cost-effective. Secondly, we use automated fact checking principles to ensure the original data quality as well as the data retrieval by LLM. Ultimately, our aim is to provide guidance for the publication of organic chemical experimental data to assist researchers and enhance FAIR data.
Automatic extraction of FAIR data from publications using LLM
04 December 2023, Version 2
This content is a preprint and has not undergone peer review at the time of posting.