Exploring Feature Engineering for Crystal Structure Classification: Interactive Applications of PCA and PLS-DA Clustering

11 April 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Active learning through interactive exploration significantly enhances student engagement and understanding in chemical education. This educational activity leverages Principal Component Analysis (PCA) and Partial Least Square-Discriminant Analysis (PLS-DA), two foundational machine learning techniques widely applied in contemporary research. Interactive Python-based Jupyter notebooks offer accessible educational platforms for students exploring the chemical data, requiring no prior 15 programming experience. These notebooks allow learners to actively engage in feature exploration and dimensionality reduction processes, applied to clustering and classifying binary AB equiatomic solid state compounds. Students can actively select and modify chemical and physical features, observing in real time how these choices impact the effectiveness of PCA and PLS-DA clustering models. Initially, PCA enables unsupervised visualization of natural clustering and correlations among compounds 20 without prior labeling. Subsequently, employing PLS-DA, students develop supervised models capable of predicting crystal structures, explicitly illustrating supervised versus unsupervised learning paradigms. The proposed activity highlights the importance of explainability in machine learning models, rather than operating the models as a "black box". Beyond learning fundamental concepts, the activity encourages students to participate in genuine exploratory processes, mirroring the investigative 25 approaches historically utilized by researchers and practiced today. By experimenting freely with datasets and computational methods, students experience firsthand the iterative nature of scientific discovery, fostering deeper insight into both chemical informatics and the broader research methodology.

Keywords

First-Year Undergraduate
Chemoinformatics
Inorganic Chemistry
Computer-Based Learning
Materials Science
Solid State Chemistry
Machine Learning

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.