Abstract
The European Chemicals Agency (ECHA) and US Environmental Protection Agency (EPA) have listed approximately 800k chemicals that must be further investigated for their potential environmental and/or human health risk. A significant number of these chemicals have large enough global volumes of consumption (e.g. industrial and agro- chemical) to reach the limits of detection of our analytical chemistry methods in en- vironmental samples, but experimental data on their environmental fate and toxicity are largely missing. Filling these data gaps experimentally for such a large number of chemicals is practically impossible, making model approaches to predict chemical property data highly relevant. However, the currently available models suffer from limited training sets, linearity and continuity assumptions. In this study we present a supervised direct classification model that directly connects the molecular descriptors of chemicals to their toxicity. As a proof of concept we used 907 experimentally defined 96h LC50 values for acute fish toxicity. Classification was performed into two typesof toxicity categories: 1) categories derived via k-means clustering from the experimental dataset and 2) hazard categories defined by the Globally Harmonized System of Classification and Labelling of Chemicals (GHS), via machine learning. Our direct classification model explained ≈ 90% of variance in our data for the training set and ≈ 80% for the test set. Direct comparison of our classification model with the conventional strategy (i.e. QSAR regression model) resulted in a 5 fold decrease in the wrong chemical categorization for our model. The optimized model was employed to predict the toxicity categories of ≈ 32k chemicals (from the Norman SusDat). Finally, a comparison between the model based applicability domain (AD) vs the training set AD was performed, suggesting that the training set based AD is a more adequate way to avoid extrapolation when using such models. The better performance of our direct classification model compared to conventionally employed QSAR methods, makes this approach a viable tool for hazard identification and risk assessment of chemicals.
Supplementary materials
Title
SI
Description
Supporting Information
Actions