IR–NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

Federico Zipoli; Marvin Alberts; Teodoro Laino

doi:10.26434/chemrxiv-2025-l0zqz

Abstract

The construction of predictive models in molecular science increasingly relies on large, high-quality datasets. Synthetic data generation is becoming a foundational strategy for advancing model accuracy and enabling fast discovery workflows. To support the development of structure elucidation and spectral property prediction models, we present a comprehensive synthetic dataset of infrared (IR) and nuclear magnetic resonance (NMR) spectra for a diverse ensemble of organic molecules. The data were generated using a hybrid computational approach that integrates molecular dynamics (MD) simulations, density functional theory (DFT) calculations, and machine learning (ML) models. The dataset primarily consists of IR spectra for 177,461 molecules, derived from long-timescale MD simulations with ML-accelerated dipole moment predictions. In addition, it includes a smaller subset of 1H-NMR and 13C-NMR chemical shifts for 1,255 molecules. This unique combination of spectral data offers a valuable resource for benchmarking and validating computational methodologies, developing and enhancing artificial intelligence (AI) models for molecular property prediction, and facilitating the interpretation of experimental spectroscopic results. The dataset is publicly available through Zenodo, encouraging its broad utilization within the scientific community.

Keywords

Nuclear Magnetic Resonance (NMR)

Infrared Spectroscopy (IR)

Spectroscopy

Molecular Dynamics (MD)

Supplementary weblinks

Title

Description

Actions

Title

IR–NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

Description

This dataset supports the ChemRxiv preprint. It includes infrared spectra for 177,461 molecules from MD simulations with ML-predicted dipoles, and 1H-NMR and 13C-NMR shifts for 1,255 molecules computed via DFT. The data is designed to accelerate model development in spectroscopy and structure elucidation, and is publicly available on Zenodo.

Actions

View

IR–NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

Abstract

Keywords

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share

IR–NMR Multimodal Computational Spectra Dataset for 177K Patent-Extracted Organic Molecules

Authors

Abstract

Keywords

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share