Quantifying the distribution of materials data types in scientific literature across text, tables, and figures

16 November 2023, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Materials science research is a multifaceted field, with valuable data scattered across the pages of research papers in various formats. The efficient extraction of data from these papers is of paramount importance for further analysis and research. This study aims to shed light on the distribution of data in materials science papers and their interconnections. In this preliminary analysis, we systematically examined 10 random materials science papers to discern where key data types—composition, processing conditions, characterization, and performance properties—reside within the textual content, tables, and figures. Our findings reveal intriguing patterns in the presentation of data, ranging from conventional text-based descriptions to detailed tabular presentations and visually informative figures. The analysis encompasses diverse materials and highlights cases where data types are isolated or interconnected across different sources. We also address the challenges and limitations faced during the annotation process. This investigation underscores the importance of understanding data distribution within materials science papers, as it has profound implications for data accessibility and integration in the field. Furthermore, these insights pave the way for future research, particularly in the development of advanced NLP models tailored to the unique characteristics of materials science research papers and other machine learning techniques for more efficient data extraction and analysis in materials science research.

Keywords

NLP
LLM
Materials Science
Data Extraction

Supplementary materials

Title
Description
Actions
Title
Supplementary Information
Description
A comprehensive and detailed breakdown of the data type distribution within the ten analyzed materials science papers. This complements the summarized data distribution table presented in the main body of the paper, offering a more exhaustive view of how data types are distributed across text, tables, and figures in each paper.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.