Abstract
Graph convolutional neural nets, such as SchNet, [Schütt et al, Journal of Chemical Physics, 2018, 148, 241722], provide accurate predictions of chemical quantities without invoking any direct physical or chemical principles. These methods learn a hidden statistical representation of molecular systems in an end-to-end fashion; from xyz coordinates to molecular properties with many hidden layers in between. This naturally leads to the interpretability question: what underlying chemical model determines the algorithm’s accurate decision-making? To answer this question, we analyze the hidden layer activations of QM9-trained SchNet, also known as “embedding vectors” with dimension- reduction, linear discriminant analysis and Euclidean-distance measures. The result is a quantifiable geometry of the model’s decision making that identifies chemical moieties and has a low parametric space of ∼ 5 important parameters from the fully-trained 128-parameter embedding. The geometry of the embedding space organizes these moieties with sharp linear boundaries that can classify each chemical environment within < 5 × 10−4 error. Euclidean distance between embedding vectors can be used to demonstrate a versatile molecular similarity measure, outperforming other popular hand- crafted representations such as Smooth Overlap of Atomic Positions (SOAP). We also reveal that the embedding vectors can be used to extract observables that are related to chemical environments such as pKa and NMR. The work is in line with the recent push for explainable AI and gives insights into the depth of modern statistical representations of chemistry, such as graph convolutional neural nets, in this rapidly evolving technology.
Supplementary weblinks
Title
SchNet Model Embedding Vectors of QM9 Atoms Labelled According to Functional Groups Designation
Description
Embedding vectors for all atoms in the first 10k molecules in the QM9 dataset, generated by a trained SchNet model Also contains the model which the embedding vectors were extracted from . Model was trained on 100k training points (molecules) and 10k validation points of QM9.
Actions
View