Abstract
We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a
chemical from its standard International Chemical Identifier (InChI). The model uses two stacks
of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in
state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes
input and output into words or sub-words, our model processes the InChI and predicts the
2
IUPAC name character by character. The model was trained on a dataset of 10 million
InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online
PubChem service. Training took five days on a Tesla K80 GPU, and the model achieved test-set
accuracies of 95% (character-level) and 91% (whole name). The model performed particularly
well on organics, with the exception of macrocycles. The predictions were less accurate for
inorganic compounds, with a character-level accuracy of 71%. This can be explained by inherent
limitations in InChI for representing inorganics, as well as low coverage (1.4 %) of the training
data.