Machine Learning for Enhanced Identification in RPLC/HRMS Non-Targeted Workflows

Hiu Lok NGAN; Viktoriia Turkina; Denice van Herwerden; Hong Yan; Zongwei Cai; Saer Samanipour

doi:10.26434/chemrxiv-2024-mdl4q-v2

Analytical Chemistry

Search within Analytical Chemistry

Machine Learning for Enhanced Identification in RPLC/HRMS Non-Targeted Workflows

29 January 2025, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

In HRMS-based non-targeted analysis (NTA) without access to any retention information of unknown compounds, spectral matching is one of the most employed approaches for the assessment of chemical identification probability (IP). Recently, within the metabolomics community, the use of true positive (TP) probability has been proposed as an alternative to the conventional confidence assessment approaches. In this study, a combination of information extracted from the MS/MS spectra and calibrant-free predicted retention time indices (RTIs) yielded the probability of TP for each chemical annotation by integrating 3 machine learning (ML) models. Firstly, they include a molecular fingerprint (MF)-to-RTI model trained by 4,713 calibrants. Then a cumulative neutral loss (CNL)-to-RTI model was trained by 485,577 experimental spectra. Finally, a binary classification model was trained by 1,686,319 TP and true negative (TN) annotations. Our results demonstrated a high correlation (training: R2 = 0.96; testing: R2 = 0.88) between MF-derived and CNL-derived RTI values, suggesting reduced RTI error for TP annotations. By incorporating the output parameters from a previously developed library search algorithm, monoisotopic mass, and RTI error for TP determination, the k-nearest neighbors algorithm achieved a weighted F1 score of 0.65 and a Matthews correlation coefficient of 0.30 for the annotations with their spectral matching scores ≥50% of total score. The attained ML models were applied to RPLC/HRMS NTA of pesticide mixtures that were spiked in solvent blank and 100× and 10× diluted black tea matrix. The chemical IPs of TP candidates were increased by 54.5%, 52.1%, and 46.7%, respectively. This work demonstrates the application of ML at large-scale model training to enhance chemical IP of unknown compounds.

Keywords

Non-targeted screening

Identification confidence

Quantitative structure-retention relationship (QSRR)

MS/MS spectral reference library

Supervised learning

Model transferability

Supplementary materials

Title

Description

Actions

Title

Supplementary Information for Machine Learning for Enhanced Identification in RPLC/HRMS Non-Targeted Workflows

Description

This file includes: - Additional Text | - Technical Details of Model 1 Development | - Technical Details of Model 2 Development | - Technical Details of Model 3 Development | - Computation and Code Availability for Models Development - References of the Additional Text - Additional Figures S01–15 - Additional Tables S01–04

Actions

Supplementary weblinks

Title

Description

Actions

Title

Available Code for Machine Learning for Enhanced Identification in RPLC/HRMS Non-Targeted Workflows

Description

Storing all the scripts (with step-by-step instructions) for models application and identification probability determination

Actions

View

Title

Available Models for Integration of Transferable Prediction of Retention Index and Universal Library Search Enhances Exposome Identification Probability in RPLC/HRMS-Based Non-Targeted Analysis

Description

Storing all the developed models in this work

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jan 29, 2025 Version 2

Oct 15, 2024 Version 1

Version Notes

- Increased readability (e.g. shortened sentences, simplified Fig. 1, and so on) - Added the description for the training materials and output of Model 1 - Moved the technical explanation of models’ development into S.I.

Metrics

1,664

449

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2024-mdl4q-v2

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Machine Learning for Enhanced Identification in RPLC/HRMS Non-Targeted Workflows

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share