Pesticides benefit agriculture by increasing crop yield, quality, and security. However, pesticides may inadvertently harm bees, which are valuable as pollinators. Thus, candidate pesticides in development pipelines must be assessed for toxicity to bees. Leveraging a data set of 382 molecules with toxicity labels from honey bee exposure experiments, we train a support vector machine (SVM) to predict the toxicity of pesticides to honey bees. We compare two representations of the pesticide molecules: (i) a random walk feature vector listing counts of length-L walks on the molecular graph with each vertex- and edge-label sequence and (ii) the MACCS structural key fingerprint (FP), a bit vector indicating the presence/absence of a list of pre-defined subgraph patterns in the molecular graph. We explicitly construct the MACCS FPs, but rely on the fixed-length-L random walk graph kernel (RWGK) in place of the dot product for the random walk representation. The L-RWGK-SVM achieves an accuracy, precision, recall, and F1 score (mean over 2000 runs) of 0.81, 0.68, 0.71, and 0.69 on the test data set---with L=4 the mode optimal walk length. The MACCS-FP-SVM performs on par/marginally better than the L-RWGK-SVM, lends more interpretability, but varies more in performance. We interpret the MACCS-FP-SVM by illuminating which subgraph patterns in the molecules tend to strongly push them towards the toxic/non-toxic side of the separating hyperplane.
We explained and used the classical MACCS structural key fingerprint as a baseline representation for the pesticide molecules. We compare and contrast the random walk feature vector with the MACCS fingerprint—both intuitively and empirically. We interpret an SVM based on the MACCS fingerprints by illuminating which molecular subgraph patterns tend to push pesticides towards the toxic/non-toxic side of the separating hyperplane of the SVM. To adopt a more practical machine learning procedure, we now treat the random walk length L as a hyperparameter to be tuned during each train-test run, as opposed to our previous setup where we a priori specified L. We added two new illustrations to clarify the construction and meaning of the random walk feature vector.