Standardizing chemical compounds with language models

Miruna T. Cretu; Alessandra Toniato; Amol Thakkar; Amin Debabeche; Teodoro Laino; Alain C. Vaucher

doi:10.26434/chemrxiv-2022-14ztf-v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Standardizing chemical compounds with language models

10 March 2023, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

With the growing amount of chemical data stored digitally, it has become crucial to represent chemical compounds accurately and consistently. Harmonized representations facilitate the extraction of insightful information from datasets, and are advantageous for machine learning applications. To achieve consistent representations throughout datasets, one relies on molecule standardization, which is typically accomplished using rule-based algorithms that modify descriptions of functional groups. Here, we present the first deep-learning model for molecular standardization. We enable custom standardization schemes based solely on data, which, as additional benefit, support standardization options that are difficult to encode into rules. Our model achieves over 98% accuracy in learning two popular rule-based standardization protocols. We then follow a transfer learning approach to standardize metal-organic compounds (for which there is currently no automated standardization practice), based on a human-curated dataset of 1512 compounds. This model predicts the expected standardized molecular format with a test accuracy of 75.6%. As standardization can be considered, more broadly, a transformation from undesired to desired representations of compounds, the same data-driven architecture can be applied to other tasks. For instance, we demonstrate the application to compound canonicalization and to the determination of major tautomers in solution, based on computed and experimental data.

Keywords

Deep learning

Molecule standardization

Natural language processing

Chemoinformatics

Supplementary weblinks

Title

Description

Actions

Title

Code and data

Description

Python package for the standardization pipeline, and links to the relevant data used for the manuscript.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Standardizing chemical compounds with language models

Miruna T Cretu, Alessandra Toniato, Amol Thakkar, Amin A Debabeche, Teodoro Laino, Alain C Vaucher journal article

Machine Learning: Science and Technology , Volume 4, Issue 3

Online publication date: Aug 08, 2023

Version History

Mar 10, 2023 Version 2

Nov 16, 2022 Version 1

Version Notes

We report additional analyses and results for the standardization of metal-organic compounds and extend the standardization scheme to major tautomer prediction.

Metrics

2,550

1,127

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2022-14ztf-v2

Funding

Swiss National Science Foundation

NCCR Catalysis (grant number 180544)

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Standardizing chemical compounds with language models

Authors

Abstract

Keywords

Supplementary weblinks

Comments

Now Published

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share