ARKA: A framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data

29 March 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Toxicity assessment of environmental chemicals is an integral aspect of assessing the sustainability of flora and fauna constituting the aquatic and terrestrial ecosystems. A wide variety of living organisms are constantly being exposed to these chemicals, most of which generate toxic effects. Due to the lack of experimental toxicity data of environmental chemicals, there arises a need to fill data gaps by in silico approaches. One of the most commonly used in silico approaches for toxicity assessment of small datasets is the Quantitative Structure-Activity Relationship (QSAR), which generates predictive models for the efficient prediction of query compounds. However, the predictions from these models are often erroneous for some compounds, and the reliability of the predictions from QSARs derived from small datasets is often questionable from a statistical point of view. This is due to the presence of a larger number of descriptors as compared to the number of training compounds, which reduces the degree of freedom of the developed model. To reduce the overall prediction error for a particular QSAR model, we have proposed here the computation of the novel Arithmetic Residuals in K-groups Analysis (ARKA) descriptors. We have reduced the number of modeling descriptors, keeping the entire chemical space and preventing the loss of chemical information. We have used here five representative environmentally relevant endpoints (skin sensitization, earthworm toxicity, milk/plasma partitioning, algal toxicity, and rodent carcinogenicity of hazardous chemicals) with graded responses to which the ARKA framework was applied for classification modeling. On comparing the performance of the models generated using conventional QSAR descriptors and the ARKA descriptors, the prediction quality of the models derived from ARKA descriptors was found much better than the models derived from QSAR descriptors signifying the potential of ARKA descriptors in ecotoxicological classification modeling of small data sets. For the ease of users, a Java-based expert system has been developed that computes the ARKA descriptors from the input of QSAR descriptors.

Keywords

ARKA descriptors
Machine learning
Activity cliffs
Modelability
K-groups analysis

Supplementary materials

Title
Description
Actions
Title
Supplementary materials SI-1, SI-2, SI-3
Description
Raw data for analysis, Computation of ARKA descriptors, Analysis of models
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.