AlchemBERT: Exploring Lightweight Language Models for Materials Informatics

Xiaotong Liu; Yuhang Wang; Tao Yang; Xingchen Liu; Xiaodong Wen

doi:10.26434/chemrxiv-2024-r4dnl-v2

Materials Science

Search within Materials Science

AlchemBERT: Exploring Lightweight Language Models for Materials Informatics

13 February 2025, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The emergence of large language models (LLMs) has spurred numerous applications across various domains, including material design. In this field, an increasing number of generative models focus on directly generating materials with desired properties more accurately, moving away from the traditional approach of enumerating vast numbers of candidates and relying on computationally intensive screening algorithms. However, we assert that without accurate prediction capabilities, effective material design is unattainable, as generating necessary structures becomes futile if their quality cannot be reliably evaluated. Matbench provides an excellent foundation for predictive tasks, yet prior efforts with LLMs have primarily focused on composition-related tasks using models such as GPT or LLaMA. In this study, we revisit BERT, a relatively small LLM with 110 million parameters, which is significantly smaller than GPT or LLaMA models containing billions of parameters. Remarkably, we demonstrate that BERT-base achieves comparable performance to these larger models in material property prediction tasks. Extending beyond composition tasks, we introduce BERT’s application in structure prediction using CIF (Crystallographic Information File) data and natural language descriptions of structures. Our results rival state-of-the-art composition models such as CrabNet and, in several tasks across datasets ranging from a few hundred to over a hundred thousand samples, even surpass traditional structure-based and knowledge-driven models. Additionally, on the latest Matbench test task, Matbench-Discovery, our model outperformed the Voronoi RF model and achieved MAE results comparable to other models that rely solely on energy predictions. Our findings provide a new reference point for future LLM applications in material design, offering valuable insights for leveraging language model in this domain and emphasizing natural language descriptions over conventional model-centric design. We term this application of BERT in material design AlchemBERT, signifying its novel role in bridging natural language and structural representations.

Keywords

materials property prediction

machine learning

language models

BERT

Supplementary materials

Title

Description

Actions

Title

Supplementary Materials

Description

Different subset in Matbench-Discovery test

Actions

Supplementary weblinks

Title

Description

Actions

Title

The Python scripts used in this study

Description

We provide the Matbench transformed data and the related training code. If needed, feel free to contact us to request trained models for specific tasks and folds.

Actions

View

Title

Scripts for matbench-discovery test

Description

Scripts for matbench-discovery test

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Feb 13, 2025 Version 2

Dec 11, 2024 Version 1

Version Notes

Add matbench-discovery results.

Metrics

1,286

392

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2024-r4dnl-v2

Funding

National Natural Science Foundation of China

22203008

National Natural Science Foundation of China

22272009

National Science Fund for Distinguished Young Scholars

22225206

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

AlchemBERT: Exploring Lightweight Language Models for Materials Informatics

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share