Machine learning can make a strong contribution to accelerating the discovery of transition metal complexes (TMC). These compounds will play a key role in the development of new technologies for which there is an urgent need, including the production of green hydrogen from renewable sources. Despite the recent developments in machine learning for drug discovery and organic chemistry in general, the application of these methods to TMCs remains challenged by their higher complexity and the limited availability of large datasets. In this work, we report a representation for deep graph learning on TMCs – the natural quantum graph (NatQG), which leverages the electronic structure data available from natural bond orbital (NBO) analysis. This data was used to define both the topology and the information expressed by the NatQG graphs. At the topology level, two different NatQG flavors were developed: u-NatQG, with undirected edges, and d-NatQG, with edges directed along donor → acceptor orbital interactions. At the information level, the node and edge attribute vectors of both graphs contain NBO data, including natural charges and bond orders. The NatQG graphs were used to develop graph neural networks (GNNs) for the prediction of the quantum properties underlying the structure and reactivity of TMCs (e.g. HOMO-LUMO gap and polarizability). These models surpassed baselines based on traditional descriptors and performed at a level similar to, or higher than, state-of-the-art GNNs based on radial cutoffs. The results showed that the electronic structure information encoded by the models has a stronger impact on its accuracy than the geometric information. With the aim of benchmarking the GNNs, we also developed the transition metal quantum mechanics graph dataset (tmQMg), which provides the geometries, properties, and NatQG graphs of 60k TMCs.
Further information on the statistics of the tmQMg dataset and its outliers. Technical details of the GNN models, the baseline representation, and the linear fitting of the atomic energies used to predict energy targets. The error metrics obtained with the training dataset, the Python libraries used to develop the HyDGL code, and the computational details of the tmQMg dataset are also provided.