Prediction-inspired intelligent training for the development of c-RASAR models for organic skin sensitizers: Assessment of classification error rate from novel similarity coefficients

22 May 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The advancements in the field of cheminformatics have led to a reduction in animal testing to estimate the activity/property/toxicity of query chemicals. Read-Across Structure-Activity Relationship (RASAR) is an emerging concept that utilizes various similarity functions derived from chemical information to develop highly predictive models. Unlike quantitative structure-activity relationship (QSAR) models, RASAR descriptors of a query compound are computed from its close congeners instead of the compound itself, thus targeting predictions in the model training phase. The objective of the present study is not to propose new QSAR models for skin sensitization, but to demonstrate the enhancement in the quality of predictions of the skin-sensitizing potential of organic compounds by developing classification-based RASAR (c-RASAR) models. A diverse, previously curated, dataset was collected from the literature, for which 2D descriptors were computed. The extracted essential features were then used to develop a classification-based linear discriminant analysis (LDA) QSAR model. Furthermore, from the Read-Across-based predictions, RASAR descriptors were calculated using the basic settings of the hyperparameters for the Laplacian kernel-based optimum similarity measure. After feature selection, an LDA c-RASAR model was developed which superseded the prediction quality of the LDA-QSAR model. Various other combinations of RASAR descriptors were also taken to develop additional c-RASAR models all showing better prediction quality than the LDA QSAR model while using a lower number of descriptors. Various other machine learning c-RASAR models were also developed for comparison purposes. In this work, we have proposed and analyzed three new similarity metrics: gm_class, sm1, and sm2. The first one is an indicator variable used to generate a simple univariate c-RASAR model with good prediction ability, while the rest two are similarity indices used to analyze possible activity cliffs in the training and test sets and are believed to play an important role in the modelability analysis of datasets.

Keywords

c-RASAR
QSAR
modelability
skin sensitization
Banerjee-Roy coefficient
classification error

Supplementary materials

Title
Description
Actions
Title
Supplementary Materials SI-1 and SI-2
Description
SI-1 contains raw data used for the modeling analysis in Excel format. SI-2 is a Word file with Supplementary Tables
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.