Data augmentation strategies to improve reaction yield predictions and estimate uncertainty

Philippe Schwaller; Alain C. Vaucher; Teodoro Laino; Jean-Louis Reymond

doi:10.26434/chemrxiv.13286741.v1

Chemical reactions describe how precursor molecules react together and transform into products. The reaction yield describes the percentage of the precursors successfully transformed into products relative to the theoretical maximum. The prediction of reaction yields can help chemists navigate reaction space and accelerate the design of more effective routes. Here, we investigate the best-studied high-throughput experiment data set and show how data augmentation on chemical reactions can improve yield predictions' accuracy, even when only small data sets are available. Previous work used molecular fingerprints, physics-based or categorical descriptors of the precursors. In this manuscript, we fine-tune natural language processing-inspired reaction transformer models on different augmented data sets to predict yields solely using a text-based representation of chemical reactions. When the random training sets contain 2.5% or more of the data, our models outperform previous models, including those using physics-based descriptors as inputs. Moreover, we demonstrate the use of test-time augmentation to generate uncertainty estimates, which correlate with the prediction errors.

Data augmentation strategies to improve reaction yield predictions and estimate uncertainty

Abstract

Keywords

Comments

Version History

Version Notes

Metrics

License

DOI

Author’s competing interest statement

Share

Data augmentation strategies to improve reaction yield predictions and estimate uncertainty

Authors

Abstract

Keywords

Comments

Version History

Version Notes

Metrics

License

DOI

Author’s competing interest statement

Share