Analytical Chemistry

Predicting RP-LC retention indices of structurally unknown chemicals from mass spectrometry data

Authors

Abstract

Non-target analysis combined with high resolution mass spectrometry is considered one of the most comprehensive strategies for the detection and identification of known and unknown chemicals in complex samples. However, many compounds remain unidentified due to data complexity and limited structures in chemical databases. In this work, we have developed and validated a novel machine learning algorithm to predict the retention index (r$_i$) values for structurally (un)known chemicals based on their measured fragmentation pattern. The developed model, for the first time, enabled the predication of r$_i$ values without the need for the exact structure of the chemicals, with an $R^2$ of 0.91 and 0.77 and root mean squared error (RMSE) of 47 and 67 r$_i$ units for the Norman and amide test set, respectively. This fragment based model showed comparable accuracy in r$_i$ prediction compared to conventional descriptor-based models that rely on known chemical structure, which obtained a $R^2$ of 0.85 with and RMSE of 67.

Content

Thumbnail image of RI_paper.pdf

Supplementary material

Thumbnail image of RI_paper_supplementary.pdf
Supplementary materials
Supplementary information of the main paper: Predicting RP-LC retention indices of structurally unknown chemicals from mass spectrometry data

Supplementary weblinks

Github containing models and example code
Github containing models and example code