These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
Struct2IUPAC_v2.pdf (685.39 kB)

Struct2IUPAC -- Transformer-Based Artificial Neural Network for the Conversion Between Chemical Notations

revised on 11.01.2021, 07:36 and posted on 12.01.2021, 06:48 by Lev Krasnov, Ivan Khokhlov, Maxim Fedorov, Sergey Sosnin
Providing IUPAC chemical names is necessary for chemical information exchange. We developed a Transformer-based artificial neural architecture to translate between SMILES and IUPAC chemical notations: Struct2IUPAC and IUPAC2Struct. Our models demonstrated the performance that is comparable to rule-based solutions. We proved that both accuracy, speed of computations, and the model's robustness allow us to use it in production. Our showcase demonstrates that a neural-based solution can encourage rapid development keeping the same performance. We believe that our findings will inspire other developers to reduce development costs by replacing complex rule-based solutions with neural-based ones. The demonstration of Struct2IUPAC model is available online on Syntelly platform


Email Address of Submitting Author


Skolkovo Institute of Science and Technology



ORCID For Submitting Author


Declaration of Conflict of Interest

Maxim Fedorov and Sergey Sosnin are co-founders of Syntelly LLC. Lev Krasnov and Ivan Khokhlov are employees of Syntelly LLC

Version Notes

version 2.0 In this version, we have made corrections and improvements. We added Table 1 with the description of models’ accuracy for various beam sizes. We prepared the distribution of the number of name variations (Figure 8) generated by Transformer. We fixed a bug that led to a non-uniform distribution of 100 000 selected molecules from our test set. We prepared a new 100k subset with the uniform distribution. We recalculated the performance on the new 100k dataset for direct and reverse models (Table 1) and the dependence between model accuracy and the length of SMILES (Figure 4) for the uniform test set. Also, we have redrawn Figure 5 to follow the distribution of the new 100k dataset. One can see that the performance on the new test set stays very high (although not absolute) and comparable to algorithmic-based solutions.