These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
retro-GOPRO.pdf (1.87 MB)

Retrosynthesis Prediction using Grammar-based Neural Machine Translation: An Information-Theoretic Approach

submitted on 13.04.2021, 20:25 and posted on 15.04.2021, 07:21 by Vipul Mann, Venkat Venkatasubramanian
Retrosynthetic prediction is one of the main challenges in chemical synthesis that requires identifying reaction pathways and precursor molecules for synthesizing a target molecule. This requires a search over the space of plausible chemical reactions that often results in complex, multi-step, branched synthesis trees for even moderately complex organic reactions. Here, we propose an approach that performs single-step retrosynthesis prediction using SMILES grammar-based representations in a neural machine translation framework. Information-theoretic analyses of such grammar-representations reveal that they are both superior and well-suited for machine learning tasks due to their underlying redundancy and high information capacity compared to purely character-based representations. We report the top-1 prediction accuracy of 43.8% (top-5 measure of 61.4%) and syntactic validity of 95.6% (top-5 measure of 91.6%) on a standard reaction dataset. Comparing our model's performance with previous work that used purely character-based SMILES representations demonstrate improved accuracy and reduced grammatically invalid predictions.


Center for the Management of Systemic Risk (CMSR), Columbia University, New York


Email Address of Submitting Author


Columbia University


United States

ORCID For Submitting Author


Declaration of Conflict of Interest

The authors declare no conflict of interest.