ORDerly: Datasets and benchmarks for chemical reaction data

Daniel Wigh; Joe Arrowsmith; Alexander Pomberger; Kobi Felton; Alexei Lapkin

doi:10.26434/chemrxiv-2023-qkjtb-v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

ORDerly: Datasets and benchmarks for chemical reaction data

30 August 2023, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Machine learning has the potential to provide tremendous value to the life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction datasets for training ML models. Herein, we present ORDerly, an open-source Python package for customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean US patent data stored in ORD and generate datasets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on datasets generated with ORDerly for condition prediction and show that datasets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalisation. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.

Keywords

Supplementary materials

Title

Description

Actions

Title

ORDerly: Supplementary Information

Description

A: ORDerly Datasheet B: Dataset extraction and cleaning methodology C: Further experimental details (training ML models) D: ORDerly benchmark statistics E: Example reaction instances and predictions

Actions

Supplementary weblinks

Title

Description

Actions

Title

ORDerly GitHub Repository

Description

ORDerly source code.

Actions

View

Title

ORDerly benchmark datasets

Description

ORDerly benchmark datasets for reaction condition prediction, forward prediction, and retrosynthesis.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Aug 30, 2023 Version 2

Aug 03, 2023 Version 1

Version Notes

We use ORDerly to generate datasets for forward prediction and retrosynthesis, and train Molecular Transformer on these datasets. We evaluate on a random split held out test set and also on non-USPTO data.

Metrics

3,217

2,243

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2023-qkjtb-v2

Funding

UCB Pharma

Innovation Centre in Digital Molecular Technologies

EPSRC Centre for Doctoral Training in Automated Chemical Synthesis Enabled by Digital Molecular Technologies

EP/S024220/1

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

ORDerly: Datasets and benchmarks for chemical reaction data

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share