Roadmap to Pharmaceutically Relevant Reactivity Models Leveraging High-Throughput Experimentation

19 September 2022, Version 1
This content is a preprint and has not undergone peer review at the time of posting.


The merger of High-Throughput Experimentation (HTE) and data science presents an opportunity to both accelerate and inspire innovations in synthetic chemistry. Similarly, developments in machine learning (ML) have enabled the distillation of large and complex data sets into predictive models capable of generalizing patterns in the data. However, efforts to merge HTE with ML remain constrained by a few reported datasets with limited structural diversity and corresponding trained models that do not extrapolate well to substrates beyond the training set. Herein, we detail the first ML models for Pd-catalyzed C–N couplings using pharmaceutically relevant structurally diverse large data sets (~ 5000 unique products) generated using nanomole scale compatible chemistry. Careful consideration is given to both the diversity of the data set and accurate model predictions for substrates bearing features beyond those present in the training set. The structural diversity in the data set is enabled by leveraging the Merck & Co., Inc Building Block Collection with an initial focus on C–N coupling using secondary amines. The large dataset enables the systematic evaluation of model performance using five different data-splitting strategies. These five splits are carefully designed to evaluate the model’s ability to extrapolate beyond the substrates in the training set. The accuracy of classification models built with a lens toward application to medicinal chemistry campaigns exceeded the baseline precision-recall by 25-67% depending on the splitting strategy. These results would manifest as significant enrichment of successful C–N couplings using the hits recommended by the models. In addition, the accuracy of the best models for each of the five splits ranges between 70-87% suggesting excellent overall predictivity of the models even for completely unseen substrates.


High-Throughput Experimentation
Machine Learning
palladium–catalyzed C–N cross–coupling
aryl halides
secondary amines

Supplementary materials

Supplementary material
Supporting information on experimental and modeling details referred to in the manuscript text


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.