Data-Efficient, Chemistry-Aware Machine Learning Predictions of Diels–Alder Reaction Outcomes

06 March 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized datasets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels–Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse dataset of 9,537 intramolecular, hetero-, aromatic, and inverse electron demand Diels–Alder reactions. This dataset is used to train a NERF model and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the dataset is used for training. Another high-performing model, Chemformer, requires a larger training dataset (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low data regime.

Keywords

Diels–Alder
Product Prediction
Reaction Prediction
Graph Neural Networks
Natural Language Processors

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
Contains additional figures, explanations, Reaxys reaction IDs, and link to GitHub containing Jupyter notebook to regenerate dataset.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.