Abstract
Machine learning interatomic potentials (MLIPs) have revolutionized the field of atomistic materials simulation, both due to their remarkable accuracy and their computational efficiency compared to established \textit{ab initio} methods. Very recently, several general purpose MLIPs have been reported, which are broadly applicable across the periodic table. These represent a fascinating opportunity for materials discovery, provided that they are robust and transferable. In order to stress test current general purpose MLIPs, we evaluate the performance of M3GNet and MACE models in element-substitution based structure prediction workflows for a diverse range of inorganic, crystalline materials. Importantly, these results are compared with a full density functional based workflow, shifting the focus from merely evaluating single-point energy and force predictions of MLIPs towards an end-to-end perspective. We find that general purpose MLIPs are in general well-suited to accelerate computational materials discovery and structure prediction, but also display certain systematic biases. To address these, a simple metric to quantify MLIP reliability for materials discovery is introduced. As a by-product, we also predict novel ground state structures for 15 out of 100 analysed compositions.