Abstract
The human exposome is represented by a vast number of chemicals, the fate and behavior of which remain largely unexplored. While modeling approaches are commonly employed to address this challenge, there is a recognized need for alternative molecular representations, such as molecular fingerprints. However, existing algorithms for computing molecular fingerprints may incorporate irrelevant or insufficient information for accurate activity prediction. In this study, we present an algorithm designed to optimize molecular fingerprints. This algorithm combines the relevant bits of information, aiming to enrich the final fingerprint for predicting specific behavioral properties. To achieve this, relevant variables (i.e. bits) for prediction were collected from six non-hashed fingerprints and fused into a master fingerprint. We used fish toxicity as a proof of concept. The RFR model was developed based on the master fingerprint. It demonstrated comparable results to conventional descriptor-based models with R$^2$ $\approx 0.9$ for the training set and R$^2$ $\approx 0.6$ for the test set. The molecular fingerprints have the advantage of being consistent and interpretable. Consequently, we were able to confirm the relevance of variables to the toxicity prediction. The final model outperformed each of the models based on individual fingerprints in the number of chemicals with prediction error, that fell in the range of +/- one standard deviation of residuals. The number of cases with the lower prediction error was on average four times higher for the master fingerprint-based model. The algorithm developed for optimizing molecular fingerprints is universal and can be applied to various case studies.
Supplementary materials
Title
Supporting information for: Molecular Fingerprints Optimization for Enhanced Predictive Modeling
Description
The Supporting Information with figures (S1 - S8) shows each of individual model predictions and residuals, absence of correlation between number of bits in individual fingerprint and the accuracy of the models based on ones, and the contribution of each individual fingerprint into final optimized fingerprint.
Actions