Abstract
Multi-modal contrastive learning offers a powerful framework for molecular structure elucidation from vibrational spectra. VibraCLIP applies this approach by embedding infrared (IR), Raman spectra, and molecular graphs into a unified latent space, enabling accurate, on-the-fly, and scalable molecular identification. Aligning IR and Raman modalities boosts Top-1 retrieval accuracy from 12.4% to 62.9%, while incorporating standardized molecular mass elevates Top-25 accuracy to 98.9%, underscoring the value of chemically grounded anchoring features. VibraCLIP’s three-way alignment strategy captures the complementary nature of vibrational modes, offering substantial improvements over single-modality baselines. A lightweight fine-tuning protocol, updating only the final projection layer, enables robust generalization from theoretical to experimental datasets. This flexible, data-efficient framework transforms vibrational spectroscopy into a high-precision tool for molecular discovery. VibraCLIP establishes a new standard for AI-driven spectral interpretation, bridging molecular structure and spectroscopy with broad impact in fields as separated as drug discovery or astrochemical observations.
Supplementary materials
Title
Supporting Information
Description
The supporting information includes a detailed table of hyperparameter optimization, model implementation details and visualization of elucidated structures across different retrieval accuracies and model’s strategies for the experimental dataset. Providing additional insights into the performance and interpretability of the proposed approach.
Actions