Abstract
Machine learning (ML) has become a powerful tool in polymer science, with its success strongly relying on effective structural representations of polymers. While the Simplified Molecular Input Line Entry System (SMILES) is widely used due to its simplicity, it was originally designed for small molecules and struggles to capture the stochastic nature of polymers. Recently, BigSMILES has been introduced as a more compact and versatile representation of polymer structures. However, the relative performance of SMILES and BigSMILES in polymer ML tasks remains unexplored.
In this study, we systematically evaluate SMILES and BigSMILES across 12 polymer-related tasks, including property prediction and inverse design, utilizing convolutional neural networks (CNNs) and large language models (LLMs). Our results show that BigSMILES enables faster training times due to its reduced token complexity, and achieves comparable or superior performance to SMILES in certain predictive tasks. Moreover, BigSMILES more accurately encodes chemical information and monomer connectivity for copolymers within LLM frameworks. This work serves as a starting point for a comprehensive evaluation of SMILES and BigSMILES in polymer ML applications, highlighting the potential of BigSMILES to streamline and accelerate polymer informatics workflows, particularly for complex systems like copolymers and polymer composites. Looking ahead, advancing polymer representations to integrate polymer chain structure, phase morphology, and processing parameters will be crucial for capturing the multifaceted relationships between polymer structure and properties, driving more accurate and efficient modeling.