From chemical similarity measures to an unconventional modeling framework: The application of c-RASAR along with dimensionality reduction techniques in a representative hepatotoxicity dataset

22 July 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

With the exponential progress in the field of cheminformatics, the conventional modeling approaches have so far been to employ supervised and unsupervised machine learning (ML) and deep learning models, utilizing the standard molecular descriptors, which represent the structural, physicochemical, and electronic properties of a particular compound. Deviating from the conventional approach, in this investigation, we have employed the classification Read-Across Structure-Activity Relationship (c-RASAR), which involves the amalgamation of the concepts of classification-based quantitative structure-activity relationship (QSAR) and Read-Across to incorporate Read-Across-derived similarity and error-based descriptors into a statistical and machine learning modeling framework. ML models developed from these RASAR descriptors use similarity-based information from the close source neighbors of a particular query compound. We have employed different classification modeling algorithms on the selected QSAR and RASAR descriptors to develop predictive models targeted towards the efficient prediction of hepatotoxicity of query compounds. The predictivity of each of these models was evaluated on a large number of test set compounds. Additionally, the best-performing model was used to screen a true external set of data. The concepts of explainable AI (XAI) coupled with Read-Across were used to interpret the contributions of the RASAR descriptors in the best c-RASAR model and to explain the chemical diversity in the dataset. The application of various unsupervised dimensionality reduction techniques like t-SNE and UMAP, and the supervised ARKA framework showed the usefulness of the RASAR descriptors over the selected QSAR descriptors in their ability to group similar compounds, enhancing the modelability of the dataset and efficiently identifying activity cliffs. Furthermore, the activity cliffs were also identified from Read-Across by observing the nature of compounds constituting the nearest neighbors for a particular query compound. On comparing our simple linear c-RASAR model with the previously reported models developed using the same dataset derived from the US FDA Orange Book (https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm), it was observed that our model is simple, reproducible, transferable, and highly predictive. The performance of the LDA c-RASAR model on the true external set supersedes that of the previously reported work. Therefore, the present simple LDA c-RASAR model can efficiently be used to predict the hepatotoxicity of query chemicals.

Keywords

Hepatotoxicity
c-RASAR
ARKA
Banerjee-Roy coefficient
Activity cliffs
Dimensionality reduction

Supplementary materials

Title
Description
Actions
Title
Supplementary Materials SI-1 and SI-2
Description
Supplementary Material-1 (SI-1) contains the original data set used for modeling with the values of structural and physicochemical descriptors and also selected RASAR descriptors. Supplementary Material-2 (SI-2) contains the following Tables and Figures. Table S1. List of the 18 different RASAR descriptors and their significance Table S2. List of QSAR descriptors used for modeling Table S3. Optimized hyperparameter settings for the ML-based QSAR and c-RASAR models Table S4. Comparison of test set prediction performance of c-RASAR vs. QSAR models Table S5. LDA c-RASAR model coefficients Figure S1. Most discriminating QSAR descriptors. Figure S2. Heat maps showing the variation in the selected RASAR descriptors for the first and last 20 compounds of the training and test sets. Figure S3. Chord diagram representing the contribution of sm1 descriptors towards positive and negative activity values.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.