Levenshtein Augmentation Improves Performance of SMILES Based Deep-Learning Synthesis Prediction

Dean Sumner; Jiazhen He; Amol Thakkar; Ola Engkvist; Esben Jannik Bjerrum

doi:10.26434/chemrxiv.12562121.v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Levenshtein Augmentation Improves Performance of SMILES Based Deep-Learning Synthesis Prediction

06 July 2020, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

SMILES randomization, a form of data augmentation, has previously been shown to increase the performance of deep learning models compared to non-augmented baselines. Here, we propose a novel data augmentation method we call “Levenshtein augmentation” which considers local SMILES sub-sequence similarity between reactants and their respective products when creating training pairs. The performance of Levenshtein augmentation was tested using two state of the art models - transformer and sequence-to-sequence based recurrent neural networks with attention. Levenshtein augmentation demonstrated an increase performance over non-augmented, and conventionally SMILES randomization augmented data when used for training of baseline models. Furthermore, Levenshtein augmentation seemingly results in what we define as attentional gain – an enhancement in the pattern recognition capabilities of the underlying network to molecular motifs.

Keywords

molecular transformer

Supplementary weblinks

Title

Description

Actions

Title

Description

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jul 06, 2020 Version 2

Jun 29, 2020 Version 1

Metrics

4,765

1,139

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv.12562121.v2

Funding

European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement no. 676434, “Big Data in Chemistry” (“BIGCHEM,” http://bigchem.eu).

Author’s competing interest statement

No conflicts of interest declared

Levenshtein Augmentation Improves Performance of SMILES Based Deep-Learning Synthesis Prediction

Authors

Abstract

Keywords

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Share