MicroscopyGPT: Generating Atomic-Structure Captions from Microscopy Images of 2D Materials with Vision-Language Transformers

Kamal Choudhary

doi:10.26434/chemrxiv-2025-gblz7

Determining complete atomic structures directly from microscopy images remains a longstanding challenge in materials science. MicroscopyGPT is a vision-language model (VLM) that leverages multimodal generative pre-trained transformers to predict full atomic configurations including lattice parameters, element types, and atomic coordinates, from Scanning Transmission Electron Microscopy (STEM) images. The model is trained on a chemically and structurally diverse dataset of simulated STEM images generated using the AtomVision tool and the JARVIS-DFT as well as the C2DB two-dimensional (2D) materials databases. The training set for finetuning comprises approximately 5000 2D materials, enabling the model to learn complex mappings from image features to crystallographic representations. I fine-tune the 11-billion-parameter LLaMA model, allowing efficient training on resource-constrained hardware. The rise of VLMs and the growth of materials datasets offer a major opportunity for microscopy-based analysis. This work highlights the potential of automated structure reconstruction from microscopy, with broad implications for materials discovery, nanotechnology, and catalysis.

MicroscopyGPT: Generating Atomic-Structure Captions from Microscopy Images of 2D Materials with Vision-Language Transformers

Abstract

Keywords

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share

MicroscopyGPT: Generating Atomic-Structure Captions from Microscopy Images of 2D Materials with Vision-Language Transformers

Authors

Abstract

Keywords

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share