Theoretical and Computational Chemistry

Struct2IUPAC -- Transformer-Based Artificial Neural Network for the Conversion Between Chemical Notations

Authors

Abstract

Providing IUPAC chemical names is necessary for chemical information exchange. We developed a Transformer-based artificial neural architecture to translate between SMILES and IUPAC chemical notations: Struct2IUPAC and IUPAC2Struct. Our models demonstrated the performance that is comparable to rule-based solutions. We proved that both accuracy, speed of computations, and the model's robustness allow us to use it in production. Our showcase demonstrates that a neural-based solution can encourage rapid development keeping the same performance. We believe that our findings will inspire other developers to reduce development costs by replacing complex rule-based solutions with neural-based ones. The demonstration of Struct2IUPAC model is available online on Syntelly platform https://app.syntelly.com/smiles2iupac

Version notes

version 2.0 In this version, we have made corrections and improvements. We added Table 1 with the description of models’ accuracy for various beam sizes. We prepared the distribution of the number of name variations (Figure 8) generated by Transformer. We fixed a bug that led to a non-uniform distribution of 100 000 selected molecules from our test set. We prepared a new 100k subset with the uniform distribution. We recalculated the performance on the new 100k dataset for direct and reverse models (Table 1) and the dependence between model accuracy and the length of SMILES (Figure 4) for the uniform test set. Also, we have redrawn Figure 5 to follow the distribution of the new 100k dataset. One can see that the performance on the new test set stays very high (although not absolute) and comparable to algorithmic-based solutions.

Content

Thumbnail image of Struct2IUPAC_v2.pdf