ChemRxiv
These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
Dean2020 - Levenshtein Augmentation - preprint_v2.pdf (672.99 kB)

Levenshtein Augmentation Improves Performance of SMILES Based Deep-Learning Synthesis Prediction

preprint
revised on 06.07.2020 and posted on 06.07.2020 by Dean Sumner, Jiazhen He, Amol Thakkar, Ola Engkvist, Esben Jannik Bjerrum

SMILES randomization, a form of data augmentation, has previously been shown to increase the performance of deep learning models compared to non-augmented baselines. Here, we propose a novel data augmentation method we call “Levenshtein augmentation” which considers local SMILES sub-sequence similarity between reactants and their respective products when creating training pairs. The performance of Levenshtein augmentation was tested using two state of the art models - transformer and sequence-to-sequence based recurrent neural networks with attention. Levenshtein augmentation demonstrated an increase performance over non-augmented, and conventionally SMILES randomization augmented data when used for training of baseline models. Furthermore, Levenshtein augmentation seemingly results in what we define as attentional gain – an enhancement in the pattern recognition capabilities of the underlying network to molecular motifs.

Funding

European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement no. 676434, “Big Data in Chemistry” (“BIGCHEM,” http://bigchem.eu).

History

Email Address of Submitting Author

esben.bjerrum@astrazeneca.com

Institution

AstraZeneca

Country

Sweden

ORCID For Submitting Author

0000-0003-1614-7376

Declaration of Conflict of Interest

No conflicts of interest declared

Exports