ChemRxiv
These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
manuscript - preprint.pdf (2.94 MB)

Efficient PCA-Exploration of High-Dimensional Datasets

preprint
submitted on 07.12.2020, 08:57 and posted on 10.12.2020, 08:12 by Oxana Ye. Rodionova, Sergey Kucheryavskiy, Alexey L. Pomerantsev

Basic tools for exploration and interpretation of Principal Component Analysis (PCA) results are well- known and thoroughly described in many comprehensive tutorials. However, in the recent decade, several new tools have been developed. Some of them were originally created for solving authentication and classification tasks. In this paper we demonstrate that they can also be useful for the exploratory data analysis.


We discuss several important aspects of the PCA exploration of high dimensional datasets, such as estimation of a proper complexity of PCA model, dependence on the data structure, presence of outliers, etc. We introduce new tools for the assessment of the PCA model complexity such as the plots of the degrees of freedom developed for the orthogonal and score distances, as well as the Extreme and Distance plots, which present a new look at the features of the training and test (new) data. These tools are simple and fast in computation. In some cases, they are more efficient than the conventional PCA tools. A simulated example provides an intuitive illustration of their application. Three real-world examples originated from various fields are employed to demonstrate capabilities of the new tools and ways they can be used. The first example considers the reproducibility of a handheld spectrometer using a dataset that is presented for the first time. The other two datasets, which describe the authentication of olives in brine and classification of wines by their geographical origin, are already known and are often used for the illustrative purposes.


The paper does not touch upon the well-known things, such as the algorithms for the PCA decomposition, or interpretation of scores and loadings. Instead, we pay attention primarily to more advanced topics, such as exploration of data homogeneity, understanding and evaluation of an optimal model complexity. The examples are accompanied by links to free software that implements the tools.

Funding

Russian state assignment No AAAA-A18-118020690203-8

History

Email Address of Submitting Author

svkucheryavski@gmail.com

Institution

Aalborg University

Country

Denmark

ORCID For Submitting Author

0000-0002-3145-7244

Declaration of Conflict of Interest

No conflict of interest

Version Notes

v.1.0 (pre-print)

Exports

Logo branding

Exports