Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metabolomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is conducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially important features in downstream analyses for biomarker discovery.
Expanded discussion (added limitations and other perspectives on methodology and approaches).
Davis TJ et al. Supplementary Information