ChemRxiv
These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
Trans-CNN.docx (478.64 kB)
0/0

Transformer-CNN: Fast and Reliable Tool for QSAR

preprint
submitted on 09.10.2019 and posted on 21.10.2019 by Pavel Karpov, Guillaume Godin, Igor Tetko
We present SMILES-embeddings derived from internal encoder state of a Transformer model trained to canonize SMILES as a Seq2Seq problem. Using CharNN architecture upon the embeddings results in a higher quality QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis grounds on an internal consensus. Both the augmentation and transfer learning based on embedding allows the method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code and the embeddings are available on https://github.com/bigchem/transformer-cnn, whereas the OCHEM environment (https://ochem.eu) hosts its on-line implementation.

History

Email Address of Submitting Author

pavel.karpov@helmholtz-muenchen.de

Institution

Helmholtz Zentrum Muenchen, STB

Country

Germany

ORCID For Submitting Author

0000-0003-4786-9806

Declaration of Conflict of Interest

No conflict of interest.

Exports