These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
Trans-CNN.docx (478.64 kB)

Transformer-CNN: Fast and Reliable Tool for QSAR

submitted on 09.10.2019, 19:19 and posted on 21.10.2019, 16:27 by Pavel Karpov, Guillaume Godin, Igor Tetko
We present SMILES-embeddings derived from internal encoder state of a Transformer model trained to canonize SMILES as a Seq2Seq problem. Using CharNN architecture upon the embeddings results in a higher quality QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis grounds on an internal consensus. Both the augmentation and transfer learning based on embedding allows the method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code and the embeddings are available on, whereas the OCHEM environment ( hosts its on-line implementation.


Email Address of Submitting Author


Helmholtz Zentrum Muenchen, STB



ORCID For Submitting Author


Declaration of Conflict of Interest

No conflict of interest.