Data-Driven Chemical Reaction Classification with Attention-Based Neural Networks

Organic reactions are usually clustered in classes that collect entities undergoing similar structural rearrangement. The classification process is a tedious task, requiring first an accurate mapping of the rearrangement (atom mapping) followed by the identification of the corresponding reaction class template. In this work, we present a transformer-based model that infers reaction classes from the SMILES representation of chemical reactions. The model reaches an accuracy of 93.8 % for a multi-class classification task involving several hundred different classes. The attention weights provided by the model give an insight into what parts of the SMILES strings are taken into account for classification, based solely on data. We study the incorrect predictions of our model and show that it uncovers different biases and mistakes in the underlying data set.