Abstract
The application of machine learning (ML) techniques to model high-throughput experimentation (HTE) datasets has seen a recent rise in popularity. Nevertheless, the ability to model the interplay between reaction components, known as interaction effects, with ML remains an outstanding challenge. Using a simulated HTE dataset, we find that the presence of irrelevant features poses a strong obstacle to learning interaction effects with common ML algorithms. To address this problem, we propose a two-part statistical modeling approach for HTE datasets: classical analysis of variance (ANOVA) of the experiment to identify systematic effects that impact reaction yield across the experiment, followed by regression of individual effects using chemistry-informed features. To illustrate this methodology, we use our previously published alcohol deoxyfluorination dataset comprising 740 reactions to build compact, interpretable regression models that account for each significant effect observed in the dataset. We achieve a sizeable performance boost compared to our previously published Random Forest model, reducing mean absolute error (MAE) from 18.1% to 13.4% and root mean squared error (RMSE) from 21.7% to 16.5% on a newly generated test set. Finally, we demonstrate that this approach can facilitate the generation of new mechanistic hypotheses which, when probed experimentally, can lead to a deeper understanding of chemical reactivity.
Supplementary materials
Title
Supporting Information
Description
Experimental procedures, experimental data, and characterization and spectral data (PDF)
Actions