AlchemBERT: Exploring Lightweight Language Models for Materials Informatics

11 December 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The advent of large language models (LLMs) has spurred numerous applications across various domains, including material design. However, we assert that without accurate prediction capabilities, effective material design is unattainable, as generating necessary structures becomes futile if their quality cannot be reliably evaluated. Matbench provides an excellent foundation for predictive tasks, yet prior efforts with LLMs have primarily focused on composition-related tasks using models such as GPT or LLaMA. In this study, we revisit BERT, a relatively small LLM with 110 million parameters, which is significantly smaller than GPT or LLaMA models containing billions of parameters. Remarkably, we demonstrate that BERT-base achieves comparable performance to these larger models in material property prediction tasks. Extending beyond composition tasks, we introduce BERT’s application in structure prediction using CIF (Crystallographic Information File) data and natural language descriptions of structures. Our results rival state-of-the-art composition models such as CrabNet and, in several tasks, even surpass traditional structure-based models like CGCNN, DeeperGATGNN, MegNet, DimeNet++, and knowledge-driven models such as MODNet. Notably, despite its parameter count, our approach excels on small datasets with minimal overfitting, indicating that fine-tuned LLMs can genuinely capture meaningful material insights. Our findings provide a new reference point for future LLM applications in material design, offering valuable insights for leveraging LLMs in this domain and introducing a paradigm shift for physicists and material scientists, emphasizing natural language descriptions over conventional model-centric design. We term this application of BERT in material design AlchemBERT, signifying its novel role in bridging natural language and structural representations.

Keywords

materials property prediction
machine learning
language models
BERT

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.