Catalysis

Roadmap to Pharmaceutically Relevant Reactivity Models Leveraging High-Throughput Experimentation

Authors

  • Jessica Xu Massachusetts Institute of Technology, Cambridge, MA, USA ,
  • Dipannita Kalyani Department of Discovery Chemistry, Merck & Co., Inc. Kenilworth, NJ 07033, USA ,
  • Thomas Struble Department of Discovery Chemistry, Merck & Co., Inc. Kenilworth, NJ 07033, USA ,
  • Spencer Dreher Department of Discovery Chemistry, Merck & Co., Inc. Kenilworth ,
  • Shane Krska Department of Discovery Chemistry, Merck & Co., Inc. Kenilworth, NJ 07033, USA ,
  • Stephen L. Buchwald Massachusetts Institute of Technology, Cambridge, MA, USA ,
  • Klavs F. Jensen Massachusetts Institute of Technology, Cambridge, MA, USA

Abstract

The merger of High-Throughput Experimentation (HTE) and data science presents an opportunity to both accelerate and inspire innovations in synthetic chemistry. Similarly, developments in machine learning (ML) have enabled the distillation of large and complex data sets into predictive models capable of generalizing patterns in the data. However, efforts to merge HTE with ML remain constrained by a few reported datasets with limited structural diversity and corresponding trained models that do not extrapolate well to substrates beyond the training set. Herein, we detail the first ML models for Pd-catalyzed C–N couplings using pharmaceutically relevant structurally diverse large data sets (~ 5000 unique products) generated using nanomole scale compatible chemistry. Careful consideration is given to both the diversity of the data set and accurate model predictions for substrates bearing features beyond those present in the training set. The structural diversity in the data set is enabled by leveraging the Merck & Co., Inc Building Block Collection with an initial focus on C–N coupling using secondary amines. The large dataset enables the systematic evaluation of model performance using five different data-splitting strategies. These five splits are carefully designed to evaluate the model’s ability to extrapolate beyond the substrates in the training set. The accuracy of classification models built with a lens toward application to medicinal chemistry campaigns exceeded the baseline precision-recall by 25-67% depending on the splitting strategy. These results would manifest as significant enrichment of successful C–N couplings using the hits recommended by the models. In addition, the accuracy of the best models for each of the five splits ranges between 70-87% suggesting excellent overall predictivity of the models even for completely unseen substrates.

Content

Thumbnail image of Manuscript_ChemRxiv_submitted.pdf

Supplementary material

Thumbnail image of SI_ChemRxiv.pdf
Supplementary material
Supporting information on experimental and modeling details referred to in the manuscript text