Translating the Molecules: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier

Jennifer Handsel; Brian Matthews; Nicola Knight; Simon Coles

doi:10.26434/chemrxiv.14170472.v1

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Translating the Molecules: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier

08 March 2021, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the 2 IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online PubChem service. Training took five days on a Tesla K80 GPU, and the model achieved test-set accuracies of 95% (character-level) and 91% (whole name). The model performed particularly well on organics, with the exception of macrocycles. The predictions were less accurate for inorganic compounds, with a character-level accuracy of 71%. This can be explained by inherent limitations in InChI for representing inorganics, as well as low coverage (1.4 %) of the training data.

Keywords

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Mar 08, 2021 Version 1

Metrics

1,433

514

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv.14170472.v1

Funding

Engineering and Physical Sciences Research Council

EP/S020357/1

An EPSRC National Research Facility to facilitate Data Science in the Physical Sciences: The Physical Sciences Data science Service (PSDS)

https://app.dimensions.ai/details/grant/grant.7829403

Author’s competing interest statement

None

Translating the Molecules: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier

Authors

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Share