SMILES-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES Pair Encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for deep learning models. As a result, SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances for both molecular generation and property prediction tasks. In molecular generation task, SPE can boost the validity and novelty of generated SMILES. Herein, the molecular property prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level tokenization. Therefore SPE could be a promising tokenization method for SMILES-based deep learning models. An open source Python package SmilesPE was developed to implement this algorithm and is now available at https://github.com/XinhaoLi74/SmilesPE.
SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning
21 May 2020, Version 1
This content is a preprint and has not undergone peer review at the time of posting.