Machine Learning for Enhanced Identification in RPLC/HRMS Non-Targeted Workflows

29 January 2025, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

In HRMS-based non-targeted analysis (NTA) without access to any retention information of unknown compounds, spectral matching is one of the most employed approaches for the assessment of chemical identification probability (IP). Recently, within the metabolomics community, the use of true positive (TP) probability has been proposed as an alternative to the conventional confidence assessment approaches. In this study, a combination of information extracted from the MS/MS spectra and calibrant-free predicted retention time indices (RTIs) yielded the probability of TP for each chemical annotation by integrating 3 machine learning (ML) models. Firstly, they include a molecular fingerprint (MF)-to-RTI model trained by 4,713 calibrants. Then a cumulative neutral loss (CNL)-to-RTI model was trained by 485,577 experimental spectra. Finally, a binary classification model was trained by 1,686,319 TP and true negative (TN) annotations. Our results demonstrated a high correlation (training: R2 = 0.96; testing: R2 = 0.88) between MF-derived and CNL-derived RTI values, suggesting reduced RTI error for TP annotations. By incorporating the output parameters from a previously developed library search algorithm, monoisotopic mass, and RTI error for TP determination, the k-nearest neighbors algorithm achieved a weighted F1 score of 0.65 and a Matthews correlation coefficient of 0.30 for the annotations with their spectral matching scores ≥50% of total score. The attained ML models were applied to RPLC/HRMS NTA of pesticide mixtures that were spiked in solvent blank and 100× and 10× diluted black tea matrix. The chemical IPs of TP candidates were increased by 54.5%, 52.1%, and 46.7%, respectively. This work demonstrates the application of ML at large-scale model training to enhance chemical IP of unknown compounds.

Keywords

Non-targeted screening
Identification confidence
Quantitative structure-retention relationship (QSRR)
MS/MS spectral reference library
Supervised learning
Model transferability

Supplementary materials

Title
Description
Actions
Title
Supplementary Information for Machine Learning for Enhanced Identification in RPLC/HRMS Non-Targeted Workflows
Description
This file includes: - Additional Text | - Technical Details of Model 1 Development | - Technical Details of Model 2 Development | - Technical Details of Model 3 Development | - Computation and Code Availability for Models Development - References of the Additional Text - Additional Figures S01–15 - Additional Tables S01–04
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.