Standardizing chemical compounds with language models


With the growing amount of chemical data stored digitally, it has become crucial to represent chemical compounds accurately and consistently. Harmonized representations facilitate the extraction of insightful information from datasets, and are advantageous for machine learning applications. To achieve consistent representations throughout datasets, one relies on molecule standardization, which is typically accomplished using rule-based algorithms that modify descriptions of functional groups. Here, we present the first deep-learning model for molecular standardization. We enable custom standardization schemes based solely on data, which, as additional benefit, support standardization options that are difficult to encode into rules. Our model achieves over 98% accuracy in learning two popular rule-based standardization protocols. We then follow a transfer learning approach to standardize metal-organic compounds (for which there is currently no automated standardization practice), based on a human-curated dataset of 1512 compounds. This model predicts the expected standardized molecular format with a test accuracy of 75.6%. As standardization can be considered, more broadly, a transformation from undesired to desired representations of compounds, the same data-driven architecture can be applied to other tasks. For instance, we demonstrate the application to compound canonicalization and to the determination of major tautomers in solution, based on computed and experimental data.

Version notes

We report additional analyses and results for the standardization of metal-organic compounds and extend the standardization scheme to major tautomer prediction.


Supplementary weblinks

Code and data
Python package for the standardization pipeline, and links to the relevant data used for the manuscript.