Theoretical and Computational Chemistry

Datasets and Their Influence on the Development of Computer Assisted Synthesis Planning Tools in the Pharmaceutical Domain


Computer Assisted Synthesis Planning (CASP) has gained considerable interest as of late. Herein we investigate a template-based retrosynthetic planning tool, trained on a variety of datasets consisting of up to 17.5 million reactions. We demonstrate that models trained on datasets such as internal Electronic Laboratory Notebooks (ELN), and the publicly available United States Patent Office (USPTO) extracts, are sufficient for the prediction of full synthetic routes to compounds of interest in medicinal chemistry. As such we have assessed the models on 1,731 compounds from 41 virtual libraries for which experimental results were known. Furthermore, we show that accuracy is a misleading metric for assessment of the ‘filter network’, and propose that the number of successfully applied templates, in conjunction with the overall ability to generate full synthetic routes be examined instead. To this end we found that the specificity of the templates comes at the cost of generalizability, and overall model performance. This is supplemented by a comparison of the underlying datasets and their corresponding models.

Version notes

initial version


Thumbnail image of Thakkar_CASP_and_dataset_performance.pdf

Supplementary material

Thumbnail image of Thakkar_CASP_and_dataset_performance_supplementary.pdf
Thakkar CASP and dataset performance supplementary
Thumbnail image of Group_SMARTS.txt
Thumbnail image of Top125_Pharmaceuticals_2018.txt
Top125 Pharmaceuticals 2018
Thumbnail image of CHEMBL_20000_sample.txt
CHEMBL 20000 sample
Thumbnail image of Datasets_Top10_Templates.txt
Datasets Top10 Templates