Abstract
Identifying molecular structure based on spectroscopic readings is a key task in a va- riety of chemical and biological applications. Common spectroscopy techniques, such as Infrared (IR) Spectroscopy and Mass Spectrometry (MS), provide detailed information on the structure of molecular compounds but nonetheless require expert-level knowl- edge to decode. Machine learning has emerged as a potential solution for automating structure prediction from chemical spectra, however current approaches generally fo- cus on single sensor modalities, neglecting to leverage the complementary information contained within differing spectra. In this paper, we introduce Peak2Patch, a novel approach to fusion-enhanced prediction of functional groups from IR and mass spec- tra. First, we perform a detailed comparison of backbone networks for encoding both sparse mass spectra and dense IR spectra and demonstrate the superior performance of transformer neural networks over current state-of-the-art convolutional neural net- works. Second, we evaluate three broad categories of fusion: early (raw feature), middle (deep feature), and late (decision) fusion, demonstrating the potential of a deep fea- ture fusion-based approach. Lastly, we present Peak2Patch, our attention-based fusion scheme which leverages cross-attention to mix features between encoded tokens of the two modalities. We validate our approach on a publicly available multimodal dataset of 790k simulated molecules, demonstrating a large improvement on functional group prediction over both the previous state-of-the-art and our own strong single-modal baselines.