ChemRxiv
These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
Struct2IUPAC_v2.pdf (685.39 kB)

Struct2IUPAC -- Transformer-Based Artificial Neural Network for the Conversion Between Chemical Notations

preprint
revised on 11.01.2021, 07:36 and posted on 12.01.2021, 06:48 by Lev Krasnov, Ivan Khokhlov, Maxim Fedorov, Sergey Sosnin
Providing IUPAC chemical names is necessary for chemical information exchange. We developed a Transformer-based artificial neural architecture to translate between SMILES and IUPAC chemical notations: Struct2IUPAC and IUPAC2Struct. Our models demonstrated the performance that is comparable to rule-based solutions. We proved that both accuracy, speed of computations, and the model's robustness allow us to use it in production. Our showcase demonstrates that a neural-based solution can encourage rapid development keeping the same performance. We believe that our findings will inspire other developers to reduce development costs by replacing complex rule-based solutions with neural-based ones. The demonstration of Struct2IUPAC model is available online on Syntelly platform https://app.syntelly.com/smiles2iupac

History

Email Address of Submitting Author

sergey.sosnin@skoltech.ru

Institution

Skolkovo Institute of Science and Technology

Country

Russia

ORCID For Submitting Author

0000-0002-3042-7369

Declaration of Conflict of Interest

Maxim Fedorov and Sergey Sosnin are co-founders of Syntelly LLC. Lev Krasnov and Ivan Khokhlov are employees of Syntelly LLC

Version Notes

version 2.0 In this version, we have made corrections and improvements. We added Table 1 with the description of models’ accuracy for various beam sizes. We prepared the distribution of the number of name variations (Figure 8) generated by Transformer. We fixed a bug that led to a non-uniform distribution of 100 000 selected molecules from our test set. We prepared a new 100k subset with the uniform distribution. We recalculated the performance on the new 100k dataset for direct and reverse models (Table 1) and the dependence between model accuracy and the length of SMILES (Figure 4) for the uniform test set. Also, we have redrawn Figure 5 to follow the distribution of the new 100k dataset. One can see that the performance on the new test set stays very high (although not absolute) and comparable to algorithmic-based solutions.

Exports