Extracting Knowledge from DFT: Experimental Band Gap Predictions Through Ensemble Learning

The field of materials science has seen an explosion in the amount of accessible high quality data. With this sudden surge of data, the application of machine learning (ML) onto materials data has led to great results. Particular success has been found in training models based on chemical formula. Such models have traditionally focused on learning from density functional theory (DFT) or experimental data. Though some researchers have explored the use of DFT calculated properties as features for learning, this has not gained much traction since the machine learning predictions would be limited by the DFT computation time and accuracy. In this work, we explore the use of a stacked ensemble learning system that combines machine learning from DFT calculations to improve learning on experimental data. This is accomplished by handling the DFT and experimental data separately, training distinct models for each. The DFT models are used to generate a "predicted DFT" value for the formulae in the experimental data. A meta-learner-trained using predictions generated by the experimental models combined with predictions from the DFT models-is shown to improve root-mean-squared-error by over 9% in the test data, when compared to a baseline model that only learns from the training data.