Abstract
We present a comprehensive and reproducible pipeline that unites literature mining, molecular graph generation, and uncertainty-aware predictive modeling to accelerate the design of organic spacer cations for 2D halide perovskites (HPs). Despite the critical influence of spacer chemistry on phase stability, excitonic behavior, transport properties and environmental robustness, the design space of HPs remains underexplored due to inconsistent reporting and limited structured datasets. To overcome this, we curated a diverse set of 200 experimental papers from various publishers and research groups into Google’s NotebookLM powered by Gemini, utilizing its retrieval-augmented generation (RAG) framework to extract synthesis-relevant metadata with high accuracy and reproducibility. To ensure data quality and consistency, we limited our selection to papers published in peer-reviewed journals with an impact factor above 10, focusing on studies with well-documented experimental protocols. Benchmarking against four other LLMs confirmed NotebookLM’s superior stability and minimal hallucination rate, making it ideal for hypothesis-driven data curation. From extracted IUPAC names, we constructed SMILES representations and augmented the dataset with over 10,000 ammonium-containing molecules from QM9. These were converted into graph-based molecular embeddings and used to train a multitask graph neural network coupled with a Gaussian process (GNN–GP) backend to predict optoelectronic and structural properties with uncertainty quantification. The Latent space clustering of the learned embeddings revealed chemically interpretable families of spacer candidates, which we cross-validated against ChatGPT-generated design heuristics. The convergence between unsupervised clustering and transformer-derived guidance highlights the power of combining LLMs with active learning to generate, test, and refine design hypotheses in underexplored chemical domains. This study demonstrates how fragmented literature can be transformed into actionable, structure–property insights through a tightly integrated informatics pipeline. Our approach lays the foundation for closed-loop, autonomous materials discovery and design and provides a scalable strategy for targeted development of next-generation HPs optoelectronics.
Supplementary materials
Title
POLARIS: Perovskite Optimization using LLM-Assisted Refinement and Intelligent Screening
Description
Supplementary for LLM benchmarking DFT calculations and other tables
Actions