Unbiasing Retrosynthesis Language Models with Disconnection Prompts

20 September 2022, Version 1
This content is a preprint and has not undergone peer review at the time of posting.


Data-driven approaches to retrosynthesis have thus far been limited in user interaction, in the diversity of their predictions, and the recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt- based inference in natural language processing to the task of chemical language modeling. We show that by using a prompt describing the disconnection site in a molecule, we can steer the model to propose a wider set of precursors, overcoming training data biases in retrosynthetic recommendations and achiev- ing a 39 % performance improvement over the baseline. For the first time, the use of a disconnection prompt empowers chemists by giving them back greater control over the disconnection predictions, resulting in more diverse and creative recommendations. In addition, in place of a human-in-the-loop strategy, we propose a schema for automatic identification of disconnection sites, followed by prediction of reactant sets, achieving a 100 % improvement in class diversity as compared to the baseline. The approach is effective in mitigating prediction biases deriving from training data. In turn, this provides a larger variety of usable building blocks, which improves the end-user digital experience. We demonstrate its application to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is key.


Reaction Informatics
Deep Learning
Artificial Intelligence
Prompt-based modeling
Interactive Models
Data Science
Organic Chemistry

Supplementary materials

Supporting Information: Unbiasing Retrosynthesis Language Models with Disconnection Prompts
Supporting information containing extended results and methods.

Supplementary weblinks


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.