Abstract
In HRMS-based non-targeted analysis (NTA) without access to any retention information of unknown compounds, spectral matching is one of the most employed approaches for the assessment of chemical identification probability (IP). Recently, within the metabolomics community, the use of true positive (TP) probability has been proposed as an alternative to the conventional confidence assessment approaches. In this study, a combination of information extracted from the MS/MS spectra and calibrant-free predicted retention time indices (RTIs) yielded the probability of TP for each chemical annotation by integrating 3 machine learning (ML) models. Firstly, they include a molecular fingerprint (MF)-to-RTI model trained by 4,713 calibrants. Then a cumulative neutral loss (CNL)-to-RTI model was trained by 485,577 experimental spectra. Finally, a binary classification model was trained by 1,686,319 TP and true negative (TN) annotations. Our results demonstrated a high correlation (training: R2 = 0.96; testing: R2 = 0.88) between MF-derived and CNL-derived RTI values, suggesting reduced RTI error for TP annotations. By incorporating the output parameters from a previously developed library search algorithm, monoisotopic mass, and RTI error for TP determination, the k-nearest neighbors algorithm achieved a weighted F1 score of 0.65 and a Matthews correlation coefficient of 0.30 for the annotations with their spectral matching scores ≥50% of total score. The attained ML models were applied to RPLC/HRMS NTA of pesticide mixtures that were spiked in solvent blank and 100× and 10× diluted black tea matrix. The chemical IPs of TP candidates were increased by 54.5%, 52.1%, and 46.7%, respectively. This work demonstrates the application of ML at large-scale model training to enhance chemical IP of unknown compounds.
Supplementary materials
Title
Supplementary Information for Machine Learning for Enhanced Identification in RPLC/HRMS Non-Targeted Workflows
Description
This file includes:
- Additional Text
| - Technical Details of Model 1 Development
| - Technical Details of Model 2 Development
| - Technical Details of Model 3 Development
| - Computation and Code Availability for Models Development
- References of the Additional Text
- Additional Figures S01–15
- Additional Tables S01–04
Actions
Supplementary weblinks
Title
Available Code for Machine Learning for Enhanced Identification in RPLC/HRMS Non-Targeted Workflows
Description
Storing all the scripts (with step-by-step instructions) for models application and identification probability determination
Actions
View Title
Available Models for Integration of Transferable Prediction of Retention Index and Universal Library Search Enhances Exposome Identification Probability in RPLC/HRMS-Based Non-Targeted Analysis
Description
Storing all the developed models in this work
Actions
View