A New Approach Methodology (NAM) for carcinogenicity prediction of organic chemicals using the Multiclass ARKA framework and machine-learning-based stacking regression

Arkaprava Banerjee; Kunal Roy

doi:10.26434/chemrxiv-2025-6lxdn

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

A New Approach Methodology (NAM) for carcinogenicity prediction of organic chemicals using the Multiclass ARKA framework and machine-learning-based stacking regression

28 May 2025, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The continuous accumulation of agricultural and industrial chemicals in the environment has significantly impacted the flora and fauna, resulting in disruptions in food chains and disturbances in the biological ecosystem. A high fraction of such chemicals poses a wide array of health risks for humans by affecting various adverse outcome pathways (AOPs). Carcinogenicity has been one of the most alarming adverse effects exhibited by these chemicals, which affects millions worldwide. For the efficient identification and safe disposal of such agricultural and industrial chemicals, it becomes necessary to quickly and easily determine toxicity and fill toxicity data gaps. In this study, we have developed multiple Machine Learning (ML) regression-based models that quantitatively predict the Oral Slope Factor (OSF) and Inhalation Slope Factor (ISF) of chemicals, identifying and prioritizing carcinogenicity risks. At first, we have developed Partial Least Squares (PLS) Quantitative Structure-Activity Relationship (QSAR) models. Aiming to further enhance the robustness and external predictivity of the models and better utilize the available chemical space, we have developed similarity-driven quantitative Read-Across Structure-Activity Relationship (q-RASAR) models. Further, we explored the Arithmetic Residuals in K-Groups Analysis (ARKA) to develop the Hybrid ARKA and ARKA-RASAR models in the quest for considering the response range-specific contribution of descriptors. We have used simple and reproducible Partial Least Squares (PLS) modeling algorithm to develop QSAR, q-RASAR, Hybrid ARKA and ARKA-RASAR models for both the responses, and further applied a wide array of ML modeling algorithms like Linear Support Vector Regression (LSVR), Ridge Regression (RR), k-Nearest Neighbor Regression (k-NN), Multilayer Perceptron Regression (MLP), Random Forests Regression (RF), Extra Trees Regression (ET), Gradient Boosting Regression (GB), PLS, and Multiple Linear Regression (MLR) as stacking regressors. The best performing models were selected using the multi-criteria decision-making approach – the Sum of Ranking Differences (SRD), considering training, test, and cross-validation statistics. Additionally, we have predicted the OSF and ISF of a true external data set and showed that the quantitative results align well with the reported carcinogenic status. With enhanced robustness and external predictivity of the models, the ARKA-RASAR approach has been shown to be a useful tool in ecotoxicological risk assessments.

Keywords

Sum of Ranking Differences

Supplementary materials

Title

Description

Actions

Title

Supplementary Materials SI-1, SI-2, SI-3

Description

Supplementary Material SI-1 contains the OSF data, training, and test sets of the QSAR, q-RASAR, Hybrid ARKA, ARKA-RASAR models, and true external prediction results. Supplementary Material SI-2 contains the ISF data, training, and test sets of the QSAR, q-RASAR, Hybrid ARKA, ARKA-RASAR models, and true external prediction results. Supplementary Material SI-3 contains Supplementary Figures showing a comparison of metrics of Stacking Regressors.

Actions

Supplementary weblinks

Title

Description

Actions

Title

Multi-Class ARKA

Description

The Multiclass-ARKA tools can be downloaded from this link

Actions

View

Title

RASAR Descriptor Calculator

Description

The RASAR descriptor calculator tool can be downloaded from this link

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

May 28, 2025 Version 1

Metrics

174

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2025-6lxdn

Funding

Defence Research and Development Organisation

LSRB/01/15001/M/LSRB-394/SH&DD/2022

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

A New Approach Methodology (NAM) for carcinogenicity prediction of organic chemicals using the Multiclass ARKA framework and machine-learning-based stacking regression

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share