Abstract
The construction of predictive models in molecular science increasingly relies on large, high-quality datasets. Synthetic data generation is becoming a foundational strategy for advancing model accuracy and enabling fast discovery workflows. To support the development of structure elucidation and spectral property prediction models, we present a comprehensive synthetic dataset of infrared (IR) and nuclear magnetic resonance (NMR) spectra for a diverse ensemble of organic molecules. The data were generated using a hybrid computational approach that integrates molecular dynamics (MD) simulations, density functional theory (DFT) calculations, and machine learning (ML) models. The dataset primarily consists of IR spectra for 177,461 molecules, derived from long-timescale MD simulations with ML-accelerated dipole moment predictions. In addition, it includes a smaller subset of 1H-NMR and 13C-NMR chemical shifts for 1,255 molecules. This unique combination of spectral data offers a valuable resource for benchmarking and validating computational methodologies, developing and enhancing artificial intelligence (AI) models for molecular property prediction, and facilitating the interpretation of experimental spectroscopic results. The dataset is publicly available through Zenodo, encouraging its broad utilization within the scientific community.
Supplementary weblinks
Title
IR–NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules
Description
This dataset supports the ChemRxiv preprint. It includes infrared spectra for 177,461 molecules from MD simulations with ML-predicted dipoles, and
1H-NMR and 13C-NMR shifts for 1,255 molecules computed via DFT. The data is designed to accelerate model development in spectroscopy and structure elucidation, and is publicly available on Zenodo.
Actions
View