Abstract
The increasing use of machine learning and artificial intelligence in chemical reaction studies demands high-quality reaction data, necessitating specialized tools enabling data understanding and cura- tion. Our work introduces a novel methodology for reaction data exploration centered on reagents — essential molecules in reactions that do not contribute atoms to products. We propose an intu- itive tool for creating interactive reagent space maps using distributed vector representations, akin to word2vec in Natural Language Processing, capturing the statistics of reagent usage within datasets. Our approach enables swift assessment of reaction type distributions, identification of alternative reagents, and detection of errors, which we demonstrate using the USPTO dataset. Our contributions include an open-source web application for visual reagent pattern analysis and a table cataloging around six hundred of the most frequent reagents in USPTO annotated with detailed roles. Accessible via GitHub at https://github.com/Academich/reagent_emb_vis, our method supports organic chemists and cheminformatics experts in navigating extensive reaction datasets efficiently.
Supplementary weblinks
Title
GitHub repository
Description
The repository to reproduce the application and the data preparation
Actions
View