Abstract
High-dimensional data arrays in molecular representations pose some significant challenges for Machine Learning applications, including overfitting and computational inefficiency during training. This is particularly relevant for Morgan Fingerprint (MFP), widely used for tasks such as classification and property prediction, when dealing with large and complex molecular datasets. This article introduces embedded Morgan Fingerprint (eMFP), a method for dimensionality reduction of MFP, while preserving the key structural information of the encoded molecule. The implementation of eMFP offers an improved data representation that mitigates the risk of overfitting while enhancing model performance. Our results demonstrate that eMFP outperforms standard MFP in regression models, including Random Forest (RF), Multi-layer Perceptron (MLP), K-Neighbors Regressor (KNR), Gradient Booster Regressor (GBR) and a Deep Neural Network (DNN), across three different databases (RedDB, NFA, and QM9), with optimal compression sizes of q = 16, q = 32 and q = 64. These findings highlight the potential of eMFP as a faster and often superior alternative to MFP, offering more efficient hyperparameter optimization in regression models for molecular property predictions, and improved performance with large datasets where high-dimensional categorical data is encoded.
Supplementary materials
Title
ESI: Embedded Morgan Fingerprints for more efficient molecular property predictions with machine learning
Description
Supplementary material
Actions