Datasets and Their Influence on the Development of Computer Assisted Synthesis Planning Tools in the Pharmaceutical Domain

Amol Thakkar; Thierry Kogej; Jean-Louis Reymond; Ola Engkvist; Esben Jannik Bjerrum

doi:10.26434/chemrxiv.9897692.v1

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Datasets and Their Influence on the Development of Computer Assisted Synthesis Planning Tools in the Pharmaceutical Domain

27 September 2019, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Computer Assisted Synthesis Planning (CASP) has gained considerable interest as of late. Herein we investigate a template-based retrosynthetic planning tool, trained on a variety of datasets consisting of up to 17.5 million reactions. We demonstrate that models trained on datasets such as internal Electronic Laboratory Notebooks (ELN), and the publicly available United States Patent Office (USPTO) extracts, are sufficient for the prediction of full synthetic routes to compounds of interest in medicinal chemistry. As such we have assessed the models on 1,731 compounds from 41 virtual libraries for which experimental results were known. Furthermore, we show that accuracy is a misleading metric for assessment of the ‘filter network’, and propose that the number of successfully applied templates, in conjunction with the overall ability to generate full synthetic routes be examined instead. To this end we found that the specificity of the templates comes at the cost of generalizability, and overall model performance. This is supplemented by a comparison of the underlying datasets and their corresponding models.

Keywords

retrosynthetic analysis

retrosynthetic analyses

Machine Learning

Computer Aided Synthesis

Retrosynthetic Prediction

drug discovery applications

synthesis planning tools

synthesis planning

predictive models

Supplementary materials

Title

Description

Actions

Title

Thakkar CASP and dataset performance supplementary

Description

Actions

Title

Group SMARTS

Description

Actions

Title

Top125 Pharmaceuticals 2018

Description

Actions

Title

CHEMBL 20000 sample

Description

Actions

Title

Datasets Top10 Templates

Description

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain

Amol Thakkar, Thierry Kogej, Jean-Louis Reymond, Ola Engkvist, Esben Jannik Bjerrum journal article

Chemical Science , Volume 11, Issue 1

Online publication date: 2020

Version History

Sep 27, 2019 Version 1

Version Notes

initial version

Metrics

5,010

1,296

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv.9897692.v1

Funding

Amol Thakkar is supported financially by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement no. 676434, “Big Data in Chemistry” (“BIGCHEM,” http://bigchem.eu).

Author’s competing interest statement

no conflict of interest

Datasets and Their Influence on the Development of Computer Assisted Synthesis Planning Tools in the Pharmaceutical Domain

Authors

Abstract

Keywords

Supplementary materials

Comments

Now Published

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Share