MASSISTANT: A Deep Learning Model for De Novo Molecular Structure Prediction from EI‑MS Spectra via SELFIES Encoding

19 March 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Gas chromatography coupled with electron impact mass spectrometry (GC‑EI‑MS) is a widely used analytical technique for identifying volatile and semi‑volatile compounds in applications ranging from pharmaceutical research to material science. However, since not every molecule is included in EI‑MS databases, scientists often have to identify unknown chromatographic peaks solely from their EI‑MS spectra. This manual interpretation is time-consuming and depends heavily on expert knowledge, often leading to ambiguous or inconclusive results. In this work, we introduce MASSISTANT, a novel deep learning model that directly predicts de novo molecular structures from low‑resolution EI‑MS spectra using SELFIES encoding. Trained on compounds with molecular weights below 600 Da, MASSISTANT’s performance is sensitive to dataset curation; while training on the full NIST dataset (180k spectra) yields approximately 10% exact predictions, a more focused, chemically homogeneous subset boosts this rate to as high as 54% (Tanimoto score = 1). These results highlight the capability of deep neural networks to capture complex fragmentation patterns and generate chemically valid structures, offering mass spectrometry scientists a powerful tool to enhance the interpretation and elucidation of whole molecular structures but also substructures, and functional groups in GC‑EI‑MS analyses.

Keywords

Deep Learning
Electron Impact Mass Spectrometry
SELFIES
De Novo Structure Prediction
GC‑MS
Cheminformatics

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.