Abstract
This study presents a comprehensive evaluation of reasoning-enhanced large language models (LLMs) — DeepSeek R1 and OpenAI o4-mini — across six critical chemical tasks. By benchmarking these models against conventional LLMs and established computa-tional tools, we systematically investigate the impact of reasoning capabilities on chemical cognition. Experimental results demon-strate that reasoning-enabled LLMs achieve significant performance improvements in foundational tasks, with DeepSeek R1 attain-ing 88.88% accuracy in SMILES-to-name conversions and 58% accuracy in point group identification, outperforming both OpenAI o4-mini (81.48% and 26%, respectively) and legacy models. However, domain-specific limitations persist: both models exhibit structural inaccuracies in CIF file generation (e.g., erroneous atomic connectivity) and struggle with ordered pattern synthesis in SEM simulations. Notably, while reasoning frameworks enhance logical coherence, they do not inherently resolve challenges in stereochemical assignments or rare symmetry group recognition. These findings underscore the necessity for domain-optimized training paradigms to bridge the gap between generic reasoning capabilities and specialized chemical applications.
Supplementary materials
Title
Supplementary materials
Description
S1. Detailed output results of SMILES code and chemical name conversion process
S2. Detailed output results of logP
S3. Prompts and corresponding raw images utilized during the image generation procedure
Actions
Title
The crystal structure file generated by the Large language model
Description
The crystal structure file generated by the Large language model
******IMPORTANT DISCLAIMER******
All Crystallographic Information Files (.cif) within this directory are algorithmically generated by Large Language Models (LLMs) and do not represent experimental data.
Complete details are provided in the WARNING.txt file within the compressed archive.
Actions