These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
Preprints are manuscripts made publicly available before they have been submitted for formal peer review and publication. They might contain new research findings or data. Preprints can be a draft or final version of an author's research but must not have been accepted for publication at the time of submission.
submitted on 09.10.2019 and posted on 21.10.2019by Pavel Karpov, Guillaume Godin, Igor Tetko
We present SMILES-embeddings derived from internal encoder state of a Transformer model trained to canonize SMILES as a Seq2Seq problem. Using CharNN architecture upon
the embeddings results in a higher quality QSAR/QSPR models on diverse benchmark datasets
including regression and classification tasks. The proposed Transformer-CNN method uses
SMILES augmentation for training and inference, and thus the prognosis grounds on an internal
consensus. Both the augmentation and transfer learning based on embedding allows the
method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code
and the embeddings are available on https://github.com/bigchem/transformer-cnn, whereas the
OCHEM environment (https://ochem.eu) hosts its on-line implementation.