A New Approach Methodology (NAM) for carcinogenicity prediction of organic chemicals using the Multiclass ARKA framework and machine-learning-based stacking regression

28 May 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The continuous accumulation of agricultural and industrial chemicals in the environment has significantly impacted the flora and fauna, resulting in disruptions in food chains and disturbances in the biological ecosystem. A high fraction of such chemicals poses a wide array of health risks for humans by affecting various adverse outcome pathways (AOPs). Carcinogenicity has been one of the most alarming adverse effects exhibited by these chemicals, which affects millions worldwide. For the efficient identification and safe disposal of such agricultural and industrial chemicals, it becomes necessary to quickly and easily determine toxicity and fill toxicity data gaps. In this study, we have developed multiple Machine Learning (ML) regression-based models that quantitatively predict the Oral Slope Factor (OSF) and Inhalation Slope Factor (ISF) of chemicals, identifying and prioritizing carcinogenicity risks. At first, we have developed Partial Least Squares (PLS) Quantitative Structure-Activity Relationship (QSAR) models. Aiming to further enhance the robustness and external predictivity of the models and better utilize the available chemical space, we have developed similarity-driven quantitative Read-Across Structure-Activity Relationship (q-RASAR) models. Further, we explored the Arithmetic Residuals in K-Groups Analysis (ARKA) to develop the Hybrid ARKA and ARKA-RASAR models in the quest for considering the response range-specific contribution of descriptors. We have used simple and reproducible Partial Least Squares (PLS) modeling algorithm to develop QSAR, q-RASAR, Hybrid ARKA and ARKA-RASAR models for both the responses, and further applied a wide array of ML modeling algorithms like Linear Support Vector Regression (LSVR), Ridge Regression (RR), k-Nearest Neighbor Regression (k-NN), Multilayer Perceptron Regression (MLP), Random Forests Regression (RF), Extra Trees Regression (ET), Gradient Boosting Regression (GB), PLS, and Multiple Linear Regression (MLR) as stacking regressors. The best performing models were selected using the multi-criteria decision-making approach – the Sum of Ranking Differences (SRD), considering training, test, and cross-validation statistics. Additionally, we have predicted the OSF and ISF of a true external data set and showed that the quantitative results align well with the reported carcinogenic status. With enhanced robustness and external predictivity of the models, the ARKA-RASAR approach has been shown to be a useful tool in ecotoxicological risk assessments.

Keywords

Carcinogenicity
QSAR
q-RASAR
Machine learning
Hybrid ARKA
ARKA RASAR
Sum of Ranking Differences

Supplementary materials

Title
Description
Actions
Title
Supplementary Materials SI-1, SI-2, SI-3
Description
Supplementary Material SI-1 contains the OSF data, training, and test sets of the QSAR, q-RASAR, Hybrid ARKA, ARKA-RASAR models, and true external prediction results. Supplementary Material SI-2 contains the ISF data, training, and test sets of the QSAR, q-RASAR, Hybrid ARKA, ARKA-RASAR models, and true external prediction results. Supplementary Material SI-3 contains Supplementary Figures showing a comparison of metrics of Stacking Regressors.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.