Abstract
The continuous accumulation of agricultural and industrial chemicals in the environment has significantly impacted the flora and fauna, resulting in disruptions in food chains and disturbances in the biological ecosystem. A high fraction of such chemicals poses a wide array of health risks for humans by affecting various adverse outcome pathways (AOPs). Carcinogenicity has been one of the most alarming adverse effects exhibited by these chemicals, which affects millions worldwide. For the efficient identification and safe disposal of such agricultural and industrial chemicals, it becomes necessary to quickly and easily determine toxicity and fill toxicity data gaps. In this study, we have developed multiple Machine Learning (ML) regression-based models that quantitatively predict the Oral Slope Factor (OSF) and Inhalation Slope Factor (ISF) of chemicals, identifying and prioritizing carcinogenicity risks. At first, we have developed Partial Least Squares (PLS) Quantitative Structure-Activity Relationship (QSAR) models. Aiming to further enhance the robustness and external predictivity of the models and better utilize the available chemical space, we have developed similarity-driven quantitative Read-Across Structure-Activity Relationship (q-RASAR) models. Further, we explored the Arithmetic Residuals in K-Groups Analysis (ARKA) to develop the Hybrid ARKA and ARKA-RASAR models in the quest for considering the response range-specific contribution of descriptors. We have used simple and reproducible Partial Least Squares (PLS) modeling algorithm to develop QSAR, q-RASAR, Hybrid ARKA and ARKA-RASAR models for both the responses, and further applied a wide array of ML modeling algorithms like Linear Support Vector Regression (LSVR), Ridge Regression (RR), k-Nearest Neighbor Regression (k-NN), Multilayer Perceptron Regression (MLP), Random Forests Regression (RF), Extra Trees Regression (ET), Gradient Boosting Regression (GB), PLS, and Multiple Linear Regression (MLR) as stacking regressors. The best performing models were selected using the multi-criteria decision-making approach – the Sum of Ranking Differences (SRD), considering training, test, and cross-validation statistics. Additionally, we have predicted the OSF and ISF of a true external data set and showed that the quantitative results align well with the reported carcinogenic status. With enhanced robustness and external predictivity of the models, the ARKA-RASAR approach has been shown to be a useful tool in ecotoxicological risk assessments.
Supplementary materials
Title
Supplementary Materials SI-1, SI-2, SI-3
Description
Supplementary Material SI-1 contains the OSF data, training, and test sets of the QSAR, q-RASAR, Hybrid ARKA, ARKA-RASAR models, and true external prediction results.
Supplementary Material SI-2 contains the ISF data, training, and test sets of the QSAR, q-RASAR, Hybrid ARKA, ARKA-RASAR models, and true external prediction results.
Supplementary Material SI-3 contains Supplementary Figures showing a comparison of metrics of Stacking Regressors.
Actions