Unbiasing Retrosynthesis Language Models with Disconnection Prompts



Data-driven approaches to retrosynthesis have thus far been limited in user interaction, in the diversity of their predictions, and the recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt- based inference in natural language processing to the task of chemical language modeling. We show that by using a prompt describing the disconnection site in a molecule, we can steer the model to propose a wider set of precursors, overcoming training data biases in retrosynthetic recommendations and achiev- ing a 39 % performance improvement over the baseline. For the first time, the use of a disconnection prompt empowers chemists by giving them back greater control over the disconnection predictions, resulting in more diverse and creative recommendations. In addition, in place of a human-in-the-loop strategy, we propose a schema for automatic identification of disconnection sites, followed by prediction of reactant sets, achieving a 100 % improvement in class diversity as compared to the baseline. The approach is effective in mitigating prediction biases deriving from training data. In turn, this provides a larger variety of usable building blocks, which improves the end-user digital experience. We demonstrate its application to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is key.


Supplementary material

Supporting Information: Unbiasing Retrosynthesis Language Models with Disconnection Prompts
Supporting information containing extended results and methods.

Supplementary weblinks

Code: Disconnection Aware Retrosynthesis
Code used to process data and trains models for disconnection aware retrosynthesis.