These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
Inferring Experimental Procedures from Text-Based Representations of Chemical Reactions
preprintsubmitted on 20.10.2020, 15:25 and posted on 21.10.2020, 09:01 by Alain C. Vaucher, Philippe Schwaller, Joppe Geluykens, Vishnu H Nair, Anna Iuliano, Teodoro Laino
The experimental execution of chemical reactions is a context-dependent and time-consuming process, often solved using the experience collected over multiple decades of laboratory work or searching similar, already executed, experimental protocols. Although data-driven schemes, such as retrosynthetic models, are becoming established technologies in synthetic organic chemistry, the conversion of proposed synthetic routes to experimental procedures remains a burden on the shoulder of domain experts. In this work, we present, for the first time, data-driven models for predicting the entire sequence of synthesis steps starting from a textual representation of a chemical equation. We generated a data set of 693,517 chemical equations and associated action sequences by extracting and processing experimental procedure text from patents, using state-of-the-art natural language models. We used the attained data set to train three different models: a nearest-neighbor model based on recently-introduced reaction fingerprints, and two deep-learning sequence-to-sequence models based on the Transformer and BART architectures. When evaluated on the ground truth data, the best performing model (transformer) achieves an accuracy of 72.7% for single action predictions, and a 100% match of the full action sequence for 3.6% of experimental procedures. An analysis by a trained chemist revealed that the predicted action sequences are adequate for execution without human intervention in more than 50% of the cases.