A reagent-driven visual method for analyzing chemical reaction data

24 January 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The increasing use of machine learning and artificial intelligence in chemical reaction studies demands high-quality reaction data, necessitating specialized tools enabling data understanding and cura- tion. Our work introduces a novel methodology for reaction data exploration centered on reagents — essential molecules in reactions that do not contribute atoms to products. We propose an intu- itive tool for creating interactive reagent space maps using distributed vector representations, akin to word2vec in Natural Language Processing, capturing the statistics of reagent usage within datasets. Our approach enables swift assessment of reaction type distributions, identification of alternative reagents, and detection of errors, which we demonstrate using the USPTO dataset. Our contributions include an open-source web application for visual reagent pattern analysis and a table cataloging around six hundred of the most frequent reagents in USPTO annotated with detailed roles. Accessible via GitHub at https://github.com/Academich/reagent_emb_vis, our method supports organic chemists and cheminformatics experts in navigating extensive reaction datasets efficiently.

Keywords

reagents
word2vec
USPTO
chemical space exploration

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.