ChemRxiv
These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
SPE_preprint_v1.pdf (986.14 kB)
0/0

SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning

preprint
submitted on 20.05.2020 and posted on 21.05.2020 by Xinhao Li, Denis Fourches
SMILES-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES Pair Encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for deep learning models. As a result, SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances for both molecular generation and property prediction tasks. In molecular generation task, SPE can boost the validity and novelty of generated SMILES. Herein, the molecular property prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level tokenization. Therefore SPE could be a promising tokenization method for SMILES-based deep learning models. An open source Python package SmilesPE was developed to implement this algorithm and is now available at https://github.com/XinhaoLi74/SmilesPE.

History

Email Address of Submitting Author

xli74@ncsu.edu

Institution

North Carolina State University

Country

United States

ORCID For Submitting Author

0000-0002-1821-2680

Declaration of Conflict of Interest

No conflict of interest

Exports