A Combined DFT/Machine Learning Framework for Materials Discovery: Application to Spinels and Assessment of Search Completeness and Efficiency

12 October 2020, Version 1
This content is a preprint and has not undergone peer review at the time of posting.


It is challenging to evaluate machine learning approaches developed for accelerating materials search and discovery in a realistic way. Machine learning approaches to materials stability prediction are typically assessed by their ability to reproduce results from direct physical modeling, whereas ideally both machine learning and direct physical modeling should be assessed by their ability to reproduce reality. Additionally, traditional evaluation metrics do not directly reflect the experience of an experimental search for unknown compounds in a large candidate phase space, and often result in overly optimistic assessments. Here, we (i) present a framework that combines density functional theory and traditional supervised machine learning methods (ML/DFT), and (ii) introduce the concepts of search completeness – the fraction of discoverable compounds found relative to the fraction of search space explored – and search efficiency – the rate of discovery relative to the fraction of search space explored – to evaluate it. The ML/DFT framework is an iterative approach to predict stable chemistries of a fixed crystal structure (here, spinels) that uses DFT to generate a training set of unstable compounds. The training set of stable compounds is given by experimentally known spinels. The method is carried out using random forest, LASSO, and ridge regression to predict as-of-yet undiscovered spinel chemistries. TreeSHAP analysis is used to determine features that most contribute to stability/instability classification. While no single feature dominates, several emerge that align with chemical intuition. To estimate the efficacy of ML/DFT compared to pure DFT, we introduce a Bayesian description of DFT distribution of energies for stable and unstable spinels. The Bayesian model enables quantifying the search completeness and search efficiency of DFT, which is then compared to that of ML/DFT. ML/DFT achieves search completeness and efficiency on par with pure DFT, despite requiring fewer DFT simulations (∼300 vs. 14,200). More importantly, by quantitatively assessing ML approaches in ways that better reflect how they would be used in materials discovery experiments, we obtain key insights into the challenges that need to be overcome by such methods: that the small number of stable compounds to be found in a search space orders of magnitude larger places stringent demands on model accuracy to achieve good search efficiency. Finally, we report the top candidates of our spinel search, which may be of interest for synthesis experiments


machine learning
phase stability


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.