Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Mingjian Wen; Samuel M. Blau; Xiaowei Xie; Shyam Dwaraknath; Kristin A. Persson

doi:10.26434/chemrxiv-2021-xr8tf-v2

Physical Chemistry

Search within Physical Chemistry

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

11 January 2022, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large amount of labelled data in order to avoid overfitting the data and thus possessing low accuracy and transferability. In this work, we propose a strategy to leverage unlabelled data to learn accurate ML models for small labelled chemical reaction data. We focus on an old and prominent problem—classifying reactions into distinct families—and build a GNN model for this task. We first pretrain the model on unlabelled reaction data using unsupervised contrastive learning and then fine-tune it on a small number of labelled reactions. The contrastive pretraining learns by making the representations of two augmented versions of a reaction similar to each other but distinct from other reactions. We propose chemically consistent reaction augmentation methods that protect the reaction center and find they are the key for the model to extract relevant information from unlabelled data to aid the reaction classification task. The transfer learned model outperforms a supervised model trained from scratch by a large margin. Further, it consistently performs better than models based on traditional rule-driven reaction fingerprints, which have long been the default choice for small datasets. In addition to reaction classification, the effectiveness of the strategy is tested on regression datasets; the learned GNN-based reaction fingerprints can also be used to navigate the chemical reaction space, which we demonstrate by querying for similar reactions. The strategy can be readily applied to other predictive reaction problems to uncover the power of unlabelled data for learning better models with a limited supply of labels.

Keywords

machine learning

graph neural networks

unsupervised learning

contrastive learning

reaction classification

chemical reactions

reaction fingerprints

Supplementary weblinks

Title

Description

Actions

Title

RxnRep GitHub repository

Description

Codes for training the models and for using the pertained models to generate reaction fingerprints.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Mingjian Wen, Samuel M. Blau, Xiaowei Xie, Shyam Dwaraknath, Kristin A. Persson journal article

Chemical Science , Volume 13, Issue 5

Online publication date: 2022

Version History

Jan 11, 2022 Version 2

Nov 23, 2021 Version 1

Version Notes

Add new tests that use existing reaction fingerprints.

Metrics

1,722

801

Views

Downloads

Citations

License

The content is available under CC BY NC 4.0

DOI

10.26434/chemrxiv-2021-xr8tf-v2

Funding

US Department of Energy

DE-AC02-05CH11231

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Authors

Abstract

Keywords

Supplementary weblinks

Comments

Now Published

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share