Molecular Fingerprints Optimization for Enhanced Predictive Modeling

26 February 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The human exposome is represented by a vast number of chemicals, the fate and behavior of which remain largely unexplored. While modeling approaches are commonly employed to address this challenge, there is a recognized need for alternative molecular representations, such as molecular fingerprints. However, existing algorithms for computing molecular fingerprints may incorporate irrelevant or insufficient information for accurate activity prediction. In this study, we present an algorithm designed to optimize molecular fingerprints. This algorithm combines the relevant bits of information, aiming to enrich the final fingerprint for predicting specific behavioral properties. To achieve this, relevant variables (i.e. bits) for prediction were collected from six non-hashed fingerprints and fused into a master fingerprint. We used fish toxicity as a proof of concept. The RFR model was developed based on the master fingerprint. It demonstrated comparable results to conventional descriptor-based models with R$^2$ $\approx 0.9$ for the training set and R$^2$ $\approx 0.6$ for the test set. The molecular fingerprints have the advantage of being consistent and interpretable. Consequently, we were able to confirm the relevance of variables to the toxicity prediction. The final model outperformed each of the models based on individual fingerprints in the number of chemicals with prediction error, that fell in the range of +/- one standard deviation of residuals. The number of cases with the lower prediction error was on average four times higher for the master fingerprint-based model. The algorithm developed for optimizing molecular fingerprints is universal and can be applied to various case studies.

Keywords

QSAR
Molecular Representation
Molecular Fingerprint
Machine Learning
Optimization

Supplementary materials

Title
Description
Actions
Title
Supporting information for: Molecular Fingerprints Optimization for Enhanced Predictive Modeling
Description
The Supporting Information with figures (S1 - S8) shows each of individual model predictions and residuals, absence of correlation between number of bits in individual fingerprint and the accuracy of the models based on ones, and the contribution of each individual fingerprint into final optimized fingerprint.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.