Abstract
The ever-increasing amount of data that is generated by computational material science tools has propelled the invention of new machine learning models, and subsequently assisted in the discovery of new materials. Here we present an overdue questioning of the data itself: is it suitable for training machine learning models? By examine the energy above the convex hull, electronic bandgap and formation energy data in the Materials Project dataset, we find that energy above the convex hull is an unsteady quantity, which is because the present materials the database do not have sufficient representation of the chemical spaces that is necessary to account for crystal decomposition. The unsteadiness of Eh also applies to DFT-computed voltages, because the computed voltage is the average of voltages obtained from the known cation-deficient stable materials. We also show the discrepancies in the reported electronic bandgap values in the Materials Project database, and the formation energy data can potentially shift due to arbitrary changes in interlayer distances of layered materials, or finding optimisation parameters that reduce the energy of the structure below the value deposited in the database. We discuss possible approaches to mitigate these data problems.