Abstract
In this work, we show that employing Retrieval Augmented Generation (RAG) with a Large Language Model (LLM) enables one to extract accurate data from scientific literature and construct datasets. The pipeline developed is simple and transferable to other scientific domains and can automate accu- rate structured data extraction. Quantization enables us to run LLMs on consumer hardware. Both Llama3-8B and Gemma2-9B with RAG give structured output consistently and with high accuracy as compared to direct prompting. Using the newly developed protocol, a dataset of metal hydrides for solid-state hydrogen storage was created. The accuracy obtained was > 93% in the cases tested. Thus, we demonstrate a pipeline to create datasets from scientific literature at minimal computational cost and high accuracy.
Supplementary weblinks
Title
Notebooks and datasets
Description
This repository contains the notebooks for the prompting and RAG method along with the datasets created in this work.
Actions
View