These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
iupac_paper_chemrxiv.pdf (723.45 kB)

Translating the Molecules: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier

submitted on 05.03.2021, 14:13 and posted on 08.03.2021, 07:19 by Jennifer Handsel, Brian Matthews, Nicola Knight, Simon Coles
We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the 2 IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online PubChem service. Training took five days on a Tesla K80 GPU, and the model achieved test-set accuracies of 95% (character-level) and 91% (whole name). The model performed particularly well on organics, with the exception of macrocycles. The predictions were less accurate for inorganic compounds, with a character-level accuracy of 71%. This can be explained by inherent limitations in InChI for representing inorganics, as well as low coverage (1.4 %) of the training data.


An EPSRC National Research Facility to facilitate Data Science in the Physical Sciences: The Physical Sciences Data science Service (PSDS)

Engineering and Physical Sciences Research Council

Find out more...


Email Address of Submitting Author


Science and Technology Facilities Council


United Kingdom

ORCID For Submitting Author


Declaration of Conflict of Interest