Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS

10 April 2025, Version 4
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Characterizing biological and environmental samples at a molecular level primarily uses tandem mass spectroscopy (MS/MS), yet the interpretation of tandem mass spectra from untargeted metabolomics experiments remains a challenge. Existing computational methods for predictions from mass spectra rely on limited spectral libraries and on hard-coded human expertise. Here, we introduce a transformer-based neural network pre-trained in a self-supervised way on millions of unannotated tandem mass spectra from our GNPS Experimental Mass Spectra (GeMS) dataset mined from the MassIVE GNPS repository. We show that pre-training our model to predict masked spectral peaks and chromatographic retention orders leads to the emergence of rich representations of molecular structures, which we name Deep Representations Empowering the Annotation of Mass Spectra (DreaMS). Further fine-tuning the neural network yields state-of-the-art performance across a variety of tasks. We make our new dataset and model available to the community and release the DreaMS Atlas -- a molecular network of 201 million MS/MS spectra constructed using DreaMS annotations.

Keywords

mass spectrometry
metabolomics
machine learning
self-supervised learning
large language models

Supplementary materials

Title
Description
Actions
Title
Supplementary information
Description
Supplementary information containing four tables: Tables S1-S2 describe the details of the GeMS dataset, Tables S3-S4 show the hyperparameter search for the pre-training and fine-tuning of DreaMS.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.