CGsmiles: A Versatile Line Notation for Molecular Representations Across Multiple Resolutions

21 February 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Coarse-grained (CG) models simplify molecular representations by grouping multiple atoms into effective particles, enabling faster simulations and reducing the chemical compound space compared to atomistic methods. Additionally, models with chemical specificity, such as Martini, may extrapolate to cases where experimental data is scarce, making CG methods highly promising for high-throughput (HT) screenings and chemical space exploration. Yet no rigorous data formats exist for the crucial aspect of describing how the atoms are grouped (i.e., the mapping). As CG models advance toward true HT capabilities, the lack of mappings and indexing capabilities for the growing number of CG molecules poses a significant barrier. To address this, we introduce CGsmiles, a versatile line notation inspired by the popular Simplified Molecular Input Line Entry System (SMILES) and BigSMILEs. CGsmiles encodes the molecular graph and particle (atom) properties independent of their resolution and incorporates a framework that allows seamless conversion between coarse- and fine-grained resolutions. By specifying fragments that describe how each particle is represented at the next finer resolution (e.g. CG particles to atoms), CGsmiles can represent multiple resolutions and their hierarchical relationships in a single string. In this paper, we present the CGSmiles syntax and analyze a benchmark set of 407 molecules from the Martini force field. We highlight key features missing in existing notations that are essential for accurately describing CG models. To demonstrate the utility of CGsmiles beyond simulations, we construct two simple machine-learning models for predicting partition coefficients, both trained on CGsmiles-indexed data and leveraging information from both CG and atomistic resolutions. Finally, we briefly discuss the applicability of CGsmiles to polymers, which particularly benefit from the multiresolution nature of the notation.

Keywords

CGsmiles
Martini
coarse-graining
SMILES

Supplementary materials

Title
Description
Actions
Title
Article Supporting Information
Description
Common mapping file formats; assignment of cis/trans isomers and mapping of sterols in Martini 3
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.