Improve retrosynthesis planning with a molecular editing language

Jiacheng Xiong; Wei Zhang; Zunyun Fu; Jiatao Huang; Xiangtai Kong; Yitian Wang; Zhaoping Xiong; Mingyue Zheng

doi:10.26434/chemrxiv-2023-bxhk8

Organic Chemistry

Search within Organic Chemistry

Improve retrosynthesis planning with a molecular editing language

26 December 2023, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Retrosynthetic analysis is a fundamental strategy in the field of organic synthesis, and many computational methods have been developed to address this significant task. A widely adopted approach is to treat retrosynthetic prediction as a sequence-to-sequence (seq2seq) translation task, where the Simplified Molecular Input Line Entry System (SMILES) of a product is translated into the SMILES of its corresponding reactants. However, these sequence-based models using SMILES also face many issues, including limited performance, lack of interpretability, and controllability. In this work, we introduce a novel chemical language for retrosynthetic prediction named E-SMILES, which is an extension of SMILES specially designed for seq2seq retrosynthetic prediction. This language not only documents the static molecular structure but also encodes the editing operations of the molecule in the retrosynthetic process, enabling it to characterize retrosynthesis reactions more effectively. By using E-SMILES, seq2seq retrosynthetic models can simulate the stepwise retrosynthetic analysis strategy of chemists, ensuring the matching of atoms between the predicted reactants and product, and yielding more interpretable and controllable predictions. Furthermore, E-SMILES is naturally aligned with the product's SMILES, reducing the edit distance between the model's input and output sequences. This liberates the model from learning the complex SMILES syntax and allows it to focus more on the retrosynthesis process itself. Leveraging E-SMILES, our retrosynthesis model achieves top-1 accuracies of 58.9% and 68.5% on the USPTO-50k dataset, with and without given reaction class, respectively, significantly surpassing previous state-of-the-art results. We envisage that E-SMILES can serve as a new foundational tool, promoting the development of sequence-based retrosynthetic prediction methods.

Keywords

Retrosynthesis planning

Model interpretability

Supplementary materials

Title

Description

Actions

Title

Supplementary Information

Description

Supplementary Information for Improve retrosynthesis planning with a molecular editing language

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Dec 26, 2023 Version 1

Metrics

2,513

1,373

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2023-bxhk8

Funding

National Natural Science Foundation of China

T2225002

National Natural Science Foundation of China

82273855

National Key Research and Development Program of China

2022YFC3400504

National Key Research and Development Program of China

2023YFC2305904

SIMM-SHUTCM Traditional Chinese Medicine Innovation Joint Research Program

E2G805H

the open fund of state key laboratory of Pharmaceutical Biotechnology, Nanjing University, China

KF-202301

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Improve retrosynthesis planning with a molecular editing language

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share