Datasets and Their Influence on the Development of Computer Assisted Synthesis Planning Tools in the Pharmaceutical Domain

27 September 2019, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Computer Assisted Synthesis Planning (CASP) has gained considerable interest as of late. Herein we investigate a template-based retrosynthetic planning tool, trained on a variety of datasets consisting of up to 17.5 million reactions. We demonstrate that models trained on datasets such as internal Electronic Laboratory Notebooks (ELN), and the publicly available United States Patent Office (USPTO) extracts, are sufficient for the prediction of full synthetic routes to compounds of interest in medicinal chemistry. As such we have assessed the models on 1,731 compounds from 41 virtual libraries for which experimental results were known. Furthermore, we show that accuracy is a misleading metric for assessment of the ‘filter network’, and propose that the number of successfully applied templates, in conjunction with the overall ability to generate full synthetic routes be examined instead. To this end we found that the specificity of the templates comes at the cost of generalizability, and overall model performance. This is supplemented by a comparison of the underlying datasets and their corresponding models.

Keywords

retrosynthetic analysis
retrosynthetic analyses
Machine Learning
Computer Aided Synthesis
Retrosynthetic Prediction
Chemistry Data
Chemistry
Reaction data
Organic Syntheses
drug discovery applications
synthesis planning tools
synthesis planning
predictive models

Supplementary materials

Title
Description
Actions
Title
Thakkar CASP and dataset performance supplementary
Description
Actions
Title
Group SMARTS
Description
Actions
Title
Top125 Pharmaceuticals 2018
Description
Actions
Title
CHEMBL 20000 sample
Description
Actions
Title
Datasets Top10 Templates
Description
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.