Evaluating molecular similarity measures: Do similarity measures reflect electronic structure properties?

30 January 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The rapid adoption of big data, machine learning (ML), and generative artificial intelligence (AI) in chemical discovery has heightened the importance of quantifying molecular similarity. Molecular similarity, commonly assessed as the distance between molecular fingerprints, is integral to applications such as database curation, diversity analysis, and property prediction. AI tools frequently rely on these similarity measures to cluster molecules under the assumption that structurally similar molecules exhibit similar properties. However, this assumption is not universally valid, particularly for continuous properties like electronic structure properties. Despite the prevalence of fingerprint-based similarity measures, their evaluation has largely depended on biological activity datasets and qualitative metrics, limiting their relevance for non-biological domains. To address this gap, we propose a framework to evaluate the correlation between molecular similarity measures and molecular properties. Our approach builds on the concept of neighborhood behavior and incorporates kernel density estimation (KDE) analysis to quantify how well similarity measures capture property relationships. Using a dataset of over 350 million molecule pairs with electronic structure, redox, and optical properties, we systematically evaluate the correlation between several molecular fingerprint generators, distance functions, and these properties. Both the curated dataset and the evaluation framework are publicly available.

Keywords

Molecular Similarity

Supplementary materials

Title
Description
Actions
Title
Supplementary Information
Description
Supplementary Information
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.
Comment number 1, Karoly Heberger: Feb 03, 2025, 11:03

While I highly appreciate the work done (especially the high number of molecules included), I miss some points heavily: i) To my knowledge the first similarity measure based on electron density has been created by Carbó et al. DOI: 10.1002/qua.560170612 . Maybe, it is worth mentioning DOI: 10.1016/s0065-3276(08)60021-0 , too. ii) Our multicriteria comparison of fingerprints & cosine similarity (Fig. 13 in [https://www.researchgate.net/publication/315513438 ] shows the subordinate role of cosine similarity. A discussion would be warranted. iii) Instead of the pairwise comparison options (cited frequently), it is expedient to make multiple (n-ary comparisons), which is computationally faster and superior in diversity picking, see e.g. [ DOI: 10.1186/s13321-021-00505-3 and DOI: 10.1186/s13321-021-00504-4 ] iv) It would be interesting to see whether better alternatives exist over “top area ratios” Best regards Karoly [email protected]

Response,
Chad Risko :
Feb 03, 2025, 20:49

Dr. Heberger - Thank you for your suggestions! Let us dig through these suggested references and come back with a response. There is a lot of excellent work in this space, and we want to be certain to appropriately account for it. Sincerely, Chad