The European and US chemical agencies have listed approximately 800k chemicals where knowledge on potential risks to human health and the environment are lacking. Filling these data gaps experimentally is impossible so in-silico approaches and prediction are essential. Many existing models are however limited by assumptions (e.g. linearity and continuity) and small training sets. In this study we present a supervised direct classification model that connects molecular descriptors to toxicity. Categories can be either data-driven (using k-means clustering) or regulatory-defined. This was tested via 907 experimentally defined 96h LC50 values for acute fish toxicity. Our classification model explained ~90% of variance in our data for the training set and ~80% for the test set. This strategy gave a 5-fold decrease in the incorrect categorization compared to a QSAR regression model. Our model was subsequently employed to predict the toxicity categories of ~32k chemicals. A comparison between the model-based applicability domain (AD) and the training set AD was performed, suggesting that the training set based AD is a more adequate way to avoid extrapolation when using such models. The better performance of our direct classification model compared to QSAR methods, makes this approach a viable tool for hazard and risk assessment of chemicals.
This is the latest version of the manuscript after revision.