Emergence of molecular structures from repository-scale self-supervised learning on tandem mass spectra

23 April 2024, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Tandem mass spectrometry (MS/MS) is the primary method for characterizing biological and environmental samples at a molecular level. Despite this, the interpretation of tandem mass spectra remains a challenge. Existing computational methods for predictions from mass spectra heavily rely on limited spectral libraries and on hard-coded human expertise. Here we introduce a transformer-based neural network pre-trained in a self-supervised way on millions of unannotated tandem mass spectra from our new GeMS (GNPS Experimental Mass Spectra) dataset mined from the MassIVE GNPS repository. We show that pre-training our model to predict masked spectral peaks and chromatographic retention orders leads to the emergence of rich representations of molecular structures, which we name DreaMS (Deep Representations Empowering the Annotation of Mass Spectra). Fine-tuning the pre-trained neural network to predict spectral similarity, molecular fingerprints, chemical properties, and the presence of fluorine from tandem mass spectra yields state-of-the-art performance across all the tasks. This underscores the practical utility of DreaMS across diverse mass spectrum interpretation tasks and establishes it as a stepping stone for future advances in the field. We make our new dataset and pre-trained models available to the community and release the DreaMS Atlas - a molecular network of 201 million MS/MS spectra constructed using DreaMS annotations.

Keywords

mass spectrometry
metabolomics
machine learning
self-supervised learning
large language models

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.