From descriptors to intrinsic fish toxicity of chemicals: an alternative approach to chemical prioritization

10 October 2022, Version 3
This content is a preprint and has not undergone peer review at the time of posting.


The European Chemicals Agency (ECHA) and US Environmental Protection Agency (EPA) have listed approximately 800k chemicals that must be further investigated for their potential environmental and/or human health risk. A significant number of these chemicals have large enough global volumes of consumption (e.g. industrial and agro- chemical) to reach the limits of detection of our analytical chemistry methods in en- vironmental samples, but experimental data on their environmental fate and toxicity are largely missing. Filling these data gaps experimentally for such a large number of chemicals is practically impossible, making model approaches to predict chemical property data highly relevant. However, the currently available models suffer from limited training sets, linearity and continuity assumptions. In this study we present a supervised direct classification model that directly connects the molecular descriptors of chemicals to their toxicity. As a proof of concept we used 907 experimentally defined 96h LC50 values for acute fish toxicity. Classification was performed into two typesof toxicity categories: 1) categories derived via k-means clustering from the experimental dataset and 2) hazard categories defined by the Globally Harmonized System of Classification and Labelling of Chemicals (GHS), via machine learning. Our direct classification model explained ≈ 90% of variance in our data for the training set and ≈ 80% for the test set. Direct comparison of our classification model with the conventional strategy (i.e. QSAR regression model) resulted in a 5 fold decrease in the wrong chemical categorization for our model. The optimized model was employed to predict the toxicity categories of ≈ 32k chemicals (from the Norman SusDat). Finally, a comparison between the model based applicability domain (AD) vs the training set AD was performed, suggesting that the training set based AD is a more adequate way to avoid extrapolation when using such models. The better performance of our direct classification model compared to conventionally employed QSAR methods, makes this approach a viable tool for hazard identification and risk assessment of chemicals.


Data scinece
Toxicity category

Supplementary materials

Supporting Information

Supplementary weblinks


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.