Efficient prediction of structural and electronic properties of hybrid 2D materials using DFT and machine learning

learning Sherif Abdulkader Tawfik, 2, a) Olexandr Isayev, b) Catherine Stampfl, Joe Shapter, 6 David A. Winkler, 8, 9, c) and Michael J. Ford d) School of Mathematical and Physical Sciences, University of Technology Sydney, Ultimo, New South Wales 2007, Australia Institute for Biomedical Materials and Devices (IBMD), Faculty of Science, University of Technology Sydney, Sydney, NSW, Australia Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, North Carolina 27599, USA School of Physics, The University of Sydney, New South Wales, 2006, Australia Australian Institute for Bioengineering and Nanotechnology, University of Queensland, St. Lucia, Queensland, 4072, Australia School of Chemical and Physical Sciences, Flinders University, Bedford Park, South Australia 5042, Australia Manufacturing, Commonwealth Scientific and Industrial Research Organisation, Bag 10, Clayton Soutth MDC, Victoria 3169, Australia Monash Institute of Pharmaceutical Sciences, Monash University, 381 Royal Parade, Parkville, Victoria 3052, Australia Latrobe Institute for Molecular Science, La Trobe University, Kingsbury Drive, Bundoora, Victoria 3086, Australia


I. INTRODUCTION
An ongoing challenge in material research is the design of materials with desired physical and chemical properties. To overcome this challenge, the application of machine learning (ML) is gaining momentum as a feasible tool that can significantly accelarate the material discovery process. 1,2 There has recently been an increasing interaction between first principles physical theories, chiefly density functional theory (DFT), and various ML frameworks. The DFT computations have been applied to the generation of computational material data for vast material databases, such as AFLOW, 3 with ∼ 1.7 million structures, and MaterialsProject, 4 with ∼ 620, 000 a) Electronic mail: sherif.abbas@uts.edu.au b) Electronic mail: olexandr@olexandrisayev.com c) Electronic mail: D.Winkler@latrobe.edu.au d) Electronic mail: mike.ford@uts.edu.au structures. This data is then fed into the ML algorithms for the rapid prediction of material properties, such as the dielectric constant, 5 superconducting critical temperature, 6 thermal and mechanical properties, 7 electronic structure, 3,8 and hydrogen storage capacity. 9 In the past few years, there has been a rising interest in the LEGO-like creation of hybrid 2D heterostructures 10 for various applications in photovoltaics and photonics. [11][12][13][14][15] A recent data mining study reported the existence of 1,825 2D materials that could be exfoliated from known experimental inorganic compounds, and this set can make ∼1.7 million bilayer structures (with n 2D materials, the number of possible bilayers is n(n − 1)/2 + n = n(n + 1)/2). The number of possible trilayers, tetralayers, etc. increases very rapidly and quickly becomes an intractable problem for DFT to explore. Is there an efficient method to rapidly predict the properties of these structures without having to perform expensive first principles calculations? Here we answer this question by applying various machine learning (ML) models for the prediction of the structural and electronic properties of layered van der Waals (vdW) materials, specifically bilayer materials that are constructed by layering different 2D materials. The success of the ML approach using a small training set could potentially save a significant amount of time and computational cost, and ultimately experimental effort.
In 2D materials, the interlayer van der Waals (vdW) forces are essential to maintain the equilbrium structure. The key structural quantitiy that indicates the strength of the vdW force in these materials is the interlayer distance, and a related energetic quantity is the layer binding energy. For DFT to be able to accurately predict these two quantities, it be should be corrected by incorporating a vdW correlation potential to the DFT correlation potential. 17,18 To this end, various forms of the vdW correlation have been proposed and applied to 2D materials, many of which displayed impressive accuracy, such as the Tkatchenko-Scheffler (TS) method 19,20 and the SCAN+rVV10 method. 21 The DFT databases, such as the aforementioned AFLOW and MaterialsProject databases, include structures that have been calculated without considering the vdW interaction potential. For example, there are a number of calculations published on the online portals of these databases that have wrong relaxed c lattice parameter, such as MoS 2 -WS 2 in which c = 22.37Å, which is extremely large compared to the lattice parameter in other hybrid bilayers such as graphene-boron nitride, which is 3.3Å. 22 This is a result of the use of non-dispersive correlation potentials. Therefore, an essential prerequisite is the construction of a data set of hybrid 2D vdW materials using a dispersionsupported DFT method.
Hybrid 2D heterostructures have recently attracted attention as potential vertical p-n junctions, in which the charge carriers move in the direction perpendicular to the plane of the 2D layers. Several vertical p-n junctions have been reported recently. [11][12][13][14][15] The 2D materials that have been investigated for p and n doping include MoS 2 , MoSe 2 , WSe 2 . With the growing number of semiconducting vdW materials, the number of possible hybrid bilayers that achieve p-n band alignment is increasing, and the prediction of the band gap of these bilayers using ML models would greatly support the search for new atomically thin p-n junction materials for optoelectronics applications.
A critical problem, however, in the application of DFT to hybrid vdW structures is the problem of creating lattice-matching interfaces between noncommensurate 2D materials. The application of DFT for studying hybrid 2D materials necessitates that the supercell describing the interface between two materials (whether parallel or perpendicular to the plane of the 2D materials) must have commensurate supercells of the two materials. One approach that is commonly adopted to this problem is to search for supercells that minimise the strain in each of the incommensurate monolayers. This is not a trival problem and often necessitates quite large strains of the order of a few percent in order to keep the size of teh supercell reasonable . The use of ML largely bypasses  this problem since DFT calculation are only performed  on the isolated individual layers and a small subset of  bilayers to generate the training and test sets for the ML  phase. Here, we use DFT for the calculation of the hybrid bilayer vdW structures. Using this data, we apply four ML models for the prediction of the interlayer distance and the band gap. We find that the considered ML models yield predictions of the interlayer distance and the band gap with reasonable accuracy.

II. COMPUTATIONAL DETAILS
A. The first principles approach We have selected our 2D structures from the large collection of 2D materials in the "2D atlas". 23 We have used VASP 24 to calculate the atomic and electronic structures using the generalized gradient approximation based on the PBE parametrization, 25 and accounted for the vdW interaction by adding the Tkatchenko-Scheffler vdW correlation potential. 20 We applied a k-point space of 8×8×1 for unit cells, and 3 × 3 × 1 for supercells, and an energy cutoff of 400 eV. The energy minimization tolerance is 10 −6 eV, and the force tolerance is 10 −2 eV/Å. For the 267 bilayers in the data set, we calculate the interlayer distance d as the equilbrium distance separating the two layers (that is, the minimum distance between the two layers), and the band gap. Note that the application of VASP for the geometries provided in the "2D atlas" by Miro et al. 23 induces a small strain on the individual layers. However, the presence of small planar strain have insignificant effects d and the band gap. 26 For our training set, we perform DFT calculations for the bilayers assembled from 53 monolayer structures, as shown in Table I. In selecting these monolayers, we have focussed on structures that satisfy the following two criteria: (1) It possesses trigonal symmetry, and (2) do not suffer from lattice distortions arising from covalent interaction with the adjacent layers. For example, we remove the CdX and ZnX monolayers (X= S, Se, Te) because of the significant layer distortions they exhibit when stacked with other layered materials.

B. The bilayer data set
For two different 2D materials, their unit cells are mostly incommensurate (that is, they have different values for the a lattice parameter). Therefore, we seek to construct the bilayers such that they are approximately commensurate; that is, for each of the two monolayers, we increase the number of monolayer images for each monolayer until the difference in the lattice constant of

BoronNitride NbSe2
Silicene TaS2  CdO NbTe2 SiliconCarbide TaSe2  GaS  NiS2 1T-HfS2 TaTe2  GaSe  NiSe2 1T-HfSe2 TiS2 Graphene NiTe2 1T-HfTe2 TiSe2 HfS2 PdS2 ReTe2 TaCl2 each monolayer supercell is < 2%. 26 We wrote a Java program that automates the search of approximately commensurate supercells, where the number of images of either monolayer does not exceed 5. For simplicity, the twist angle adopted here (that is, the angle by which one monolayer lattice rotates with respect to the other) is 0. The number of bilayers we choose from this set is 267. These bilayers will be used for the prediction of the interlayer distance. For the prediction of the band gap, we have to take into consideration that 33 out of the 53 monolayers are metallic. This makes the band gap zero for a large number of bilayers, which yields a very skewed data set that is problematic for machine learning. Therefore, we exclude all bilayers that are formed from metallic monolayers. Out of the set of 267 bilayes, the number of bilayers formed from these 20 monolayers is 49.
The application of DFT methods with vdW correlation correction for layered materials has been demonstrated to yield accurate results for the interlayer distances and the binding energies. 17,27 We use the TS method which accurately predicts the values of interlayer distances compared with the available experimental values as well as the benchmark Random Phase Appriximation (RPA) method, 27 as shown in Figure 1. In this figure, we compare between the values of the c lattice parameter for the 11 2D materials in Ref. 27 , predicted using the TS vdW correlation, RPA and the experimental values (obtained from Ref. 17 ). These values are for pristine bulk materials, not hybrid materials, however. The choice of the right method for hybrid multilayered materials is still an open problem, and is currently the subject of an ongoing theoretical investigation.
To perform the DFT calculations for bilayers, one has to find the most optimal stacking configuration for the bilayer. With respect to stacking, there are two groups of For boron nitride, we also added the AA', which is similar to AA but in which an N on one layer is faced by a B on the other layer.
bilayers: those with commensurate lattices (and thus the simulation cell is constructed from the unit cells of the two monolayers) and those with incommensurate lattices. For the commensurate lattices, we obtain the equilibriums stacking by performing a geometry relaxation for three stacking configurations: AA, AB and AB'. These configurations are displayed in Figure 2 The structure with lowest energy is then taken as the equilibrium stacking configuration. For the incommensurate unit cells, such as the boron nitride-silicon carbide bilayer, which is formed from 5 × 5 boron nitride unit cells and 4 × 4 SiliconCarbide unit cells, sliding one monolayer over the other in such large bilayers does not significantly affect the binding energy, and therefore we do not search for equilibrium stacking configurations in incommensurate bilayers.
We display the number of bilayers in which each monolayer is a component in Figure 3. This figure shows which monolayers are over-or under-represented in the data set.
Here, the InSe monolayer is only present as the bilayer InSe-InSe, and MoTe 2 is the most connected (it exists in 27 bilayers).
A useful method for visualizing high dimensionality data, such as the present data sets, is to apply the tdistributed Stochastic Neighbor Embedding (t-SNE) 28 algorithm for generating a 2D plot that clusters the data into groups that are labeled by the values in the output vector. Figure 4 shows a two-dimensional t-SNE projection mapping of chemical space of 267 hybrid 2D materials. t-SNE is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data. It maps multi-dimensional data to two dimensions suitable for human observation. Due to the selection of large number elements as building blocks, we see that materials explore vast chemical space without forming several well-defined clusters. However, we identified three broad regions with specific character: sulphides, tellurides, and CNSi (graphene, carbides, nitrides, and silicenes). On average sulphide materials have a low interlayer distance of 2.1 − 2.7Å. Tellurides typically have the highest at 3.2−3.8Å. Two extreme materials PtS 2 -PtS 2 and PdTe 2 -WTe 2 are marked accordingly.

C. The descriptors vector
The key to implementing a successful machine learning model is the descriptors. Fragment-based descriptors have demonstrated superior performance in enhancing the accuracy of ML models for the case of molecules 29 and crystals. 30 In this work we adopt the approach of Isayev et al., 30 the Property-Labelled Materials Fragments (PLMF), modified for 2D materials and is composed of 1529 descriptors. In the PLMF approach, a crystal structure is represented as a graph, with vertices decorated according to the reference properties of the atoms they represent and nodes are connecting topological neighbors according to the Voronoi tessellation. The adjacency matrix of this graph determines the global topology for a given system, including interatomic bonds and contacts within a crystal. The final descriptor vector to the Machine Learning (ML) model is obtained by partitioning a full graph into smaller subgraphs, that we call fragments by the analogy with fragment-based descriptors in cheminformatics. Every fragment starts from a node (an atom and its properties) and captures a path in the graph through a collection of bonded atoms. See Ref. 30 for technical details.
After constructing the monolayer descriptors, the next critical problem is the representation of each bilayer using the monolayer data. That is, is it possible to use the monolayer descriptors to describe the bilayers? Given that the interaction between the monolayers in a bilayer is of a dispersive nature, it does not affect the structure of either of the constituting monolayers. Therefore, it is possible to use the monolayer descriptors to describe the monolayers in the bilayer.
Using the monolayer descriptors, another question is: how to construct the bilayer descriptor vector? An intuitive choice would be to create a descriptors vector which doubles the size of the PLMF vector, and composed of the descriptors for the two monolayers. The problem in this approach is that it is sensitive to the swapping of the bilayers; that is, the descriptor vector for the bilayer A-B, made from monolayers A and B, will be a different vector from that of the bilayer B-A, even though the two bilayers are physically identical.
We approach this problem by the two following approaches: Bilayer representation 1 (BR1): For each bilayer A-B, we create two data records: one record has A's descriptors followed by B's descriptors, and the other with B's descriptors followed by A's descriptors. This method of representing multiple representations of the same data item has been previously applied in the context of organic molecules. 31 Using this representation, for the prediction of d, the data set has a total of 482 = 267×2−52 records (52 records are subtracted, instead of 53, because we have removed the PdS 2 -PdS 2 bilayer from the data set due to the appearance of covalent interaction between the S atoms of the adjacent layers), while for the prediciton of the band gap it has 78 = 49 × 2 − 20 records. Bilayer representation 2 (BR2): Instead of creating a descriptors vector that is double the size of the monolayer descriptor vector, we add the values of the descriptors in both monolayers. This method is intrinsically invariant to changes in the order of the monolayers, and can, in principle, be applied to supercells with more than two layers. Using this representation, for the prediction of d, the data set has a total of 267 records, while for the prediciton of the band gap has 49 records. Owing to the small number of records in BR2 for the band gap prediction, we do not perform machine learning on this set.
With 1529 fields, it is crucial to drop the number of descriptors using a dimensionality reduction algorithm. For this purpose we apply the least absolute shrinkage and selection operator (LASSO) algorithm on the dataset. 32 Since LASSO is a supervised dimensionality reduction algorithm, it cannot be applied to the set of monolayer descriptors. Instead, we apply LASSO on the bilayer data set in the BR2 against the interlayer distance d. In order to obtain the optimal number of descriptors using LASSO, we vary the value of α in LASSO, then apply the LASSO fitting which yields a number of descriptors, and then use the descriptors to predict using the support vector machine (discussed below). We choose the value of α which yields the lowest R 2 value.

D. The machine learning models
The dataset of bilayers constitutes a highly nonlinear prediction problem. We construct four ML models to learn the functions d and the band gap, which we describe below. Feedforward Neural Network (NN) 33 : NNs simulate the operation of the human brain. They can capture the highly nonlinear associations between input and output variables and can adaptively learn highly complex relationships. In an NN, an input layer receives the numbers in the descriptor vector. This layer is connected to a number of hidden layers, each layer being composed of a number of neurons, and finally the last hidden layer feed to an output layer whose number of neurons is the size of the output vector. Each neuron operates on the data it receives from the preious layer using an activation function which mimics the activation process in biological neurons, and a weight coefficient at the link between each two neurons. Each neuron collects its input from all of the neurons in the previous layer, and so forth up to the output layer. That is, each neuron j receives the following quantity from neurons i: p j (t) = i o i (t)w ij , where w ij is the weight connecting neurons i and j, and o i is the output of neuron i.
These NNs work by using various learning algorithms that update the values w ij , the simplest being the backpropagation learning algorithm. In this algorithm, w ij are updated according to the formula where η is the learning rate, C is the loss function, ξ(t) is a stochastic term, and t is the propagation step. For the activation function associated with each neuron, there are several choices available. Of interest to the presnt work is the logistic sigmoid which is given by a logistic (z) = 1 1+e −z , where z is the quantity received by the neuron. We use the Keras 34 python platform to implement the NN model. The network we have used for BR1 represenation has 35 × 2 = 70 input nodes, 5 nodes in a single hidden layer, and one output node. In the BR2 representation, the hidden layer has 35 input nodes. The sigmoid activation function is used in the hidden layer, while the linear activation function is used for the input and output layers. The learning rate is 0.03. Support Vector Machine (SVM): 35,36 This is a supervised learning model which was first introduced as a classification model, 35 and then modified for regression problems. 36 The SVM classifier performs the classification of the data set from selected subsets of samples, called support vectors, in which the characteristic information on class distinction is compressed. In the linear support vector regression problem which we utilize in the present work, the aim is to find the linear function f (x) = wx + b that approximates the output vector y with weights vector w such that the primal function J(x) is minimized, which is given by given the cosntraints |y n − (βx + b)| < + η n and |(βx + b) − y n | < + η * n for some small and positive variables η n and eta * n , for all n, where n is denotes the data record, and N is the total number of records, η n and eta * n are known as slack variables which must be greater than zero. Both C and are input parameters to the model. Relevance Vector Machine (RVM): 37,38 This is a sparse version of the support vector machine which attempts to amend several of the shortcomings of SVM, 38 such as non-probabilistic predictions, low sparsity, and the presence of the two fitting parameters C and which require cross validation. The RVM was introduced to enhance the sparsity of SVM and introduce a probabilistic weighting of the model weights based on Bayes' rule, assuming a Gaussian distribution of weights.
Random Forest (RF): 39 This is an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mean prediction of the individual trees. The training in RFs is based on feature aggregating 40 method. Given a training set x with output y, bagging repeatedly (B times) selects a random sample from x and y with replacement of the training set and fits decision trees to these samples. Once the training is finished, the prediction function operates by averaging the predictions from all the individual regression trees. The number of trees and the maximum tree depth are input parameters to the model.
The objective of ML is to build accurate prediction models. The quality of the model is determined by the ability of the model to predict the outcomes for cases which the model never encountered before; that is, how accurately can the model infer the outcomes based on its learning. This can be measured by splitting the data set into two parts: the training set, on which ML algorithm will be performed to build the ML model, and the test set, which will be used to test the quality of the model. The accuracy of the prediction is judged by using one of several loss functions. Here we use the following loss functions to assess the accuracy of the training: where MSE is the mean square error, MARE is the mean absolute relative error (%), R 2 is the coefficient of determination, Y i are the original test set outcomes (in our case, the DFT-calculated interlayer distances for the bilayers),Ŷ i are the predicted test set outcomes, andȲ is the average of the original test set outcomes. The significance of R 2 is that it expresses the proportion of the variance in Y i that can be predicted from the descriptor vector, and is an important measure of the ML model quality. However, its values are dependent on the size of the data set, and therefore we adopt the three quantities together, R 2 , MARE and MSE, to gauge the accuracy. For the case of the band gap prediction, we do not use the MARE because some of the values obtained are zero. Thus, for each of the four models in this work, we train the model on 80% of the data set and use the remaining 20% as the test set. We construct the test set by applying the k-means clustering 41 to the data set (the set of bilayers in BR2), which yields 6 clusters by the Silhouettes analysis. 42 Then, we randomly choose 20% of each cluster and build our test set. In the NN model, the learning iteration stops when the MSE is below 0.03 (for BR1) and 0.05 (for BR2). Then we compare the accuracy of the models based on the R 2 , the MSE and MARE values.
Each of the four models involved in this work requires the tuning of a number of parameters to ensure optimal performance. We tune the NN parameters manually, and we use the GridSearch algorithm provided by the Python scikit-learn library (GridSearchCV) to tune the parameters of the SVM and RF models. GridSearchCV calcualtes the best parameter combination by performing a cross-validated grid-search over a parameter grid. The parameters which we optimize for the SVM are: the C and γ, while those of the RF are: number of estimators and maximum depth.

A. Prediction of the interlayer distance
Using the LASSO algorithm, we find that the optimal number of descriptors for predicting d is 35 per bilayer. We summarize these descriptors in the Supporting Information.
The accuracy of the prediction of a ML model depends primarily on the quality of the descriptors vector. We find a clear demonstration of this in our comparison between the quality of predictions based on BR1 and BR2. We display the MSE, MARE and R 2 values for the fitting performed on each of the ML models in Tables II  and III for both bilayer representations, BR1 and BR2. Comparing the R 2 , MSE and MARE values of the test sets, it can be seen from Tables II and III that the SVM and RVM models exhibit a higher accuracy in BR1 than BR2, while the RF and NN models in BR2 have a higher accuracy than in BR1. Using the BR1 representation (Table II), the RF and SVM models in BR2 are both exhibiting the highest accuracy of the ML models tested in this work, with R 2 (RF) = 0.82, R 2 (RF) = 0.83, and MSE(RF) = 0.024, MSE(RF)= 0.023. Such difference in accuracy between BR1 and BR2 can be attributed to the fact that the addition of the describtor values in BR2 leads to the ambiguity with respect to the identity of the monolayer descriptors added. Figure 5 displays the correlation between the predicted and the DFT-calculated d values for the bilayers in the test set using the four ML models and using BR1. In order to understand the variation in the errors in the  various bilayers, we display in Figure 6, for each monolayer, the average and standard deviation of the MARE values obtained for the bilayers in which the monolayer is a component, comparing the predictions by the four ML models using BR1. An interesting feature in Figure 6 is that the presence of some monolayers in bilayers lead to high errors in the prediction of d, irrespective of the ML model used. It can be observed that CdO, TiTe 2 , are common to all four ML models in terms of having the largest prediction errors. ReS 2 , PdTe 2 , are common to RF, SVM and RVM models. In addition, in all four models, the accuracy of predicting the non-metallic monolayers graphene, boron nitride and silicon carbide is in the order, from highest to lowest error: silicon carbide, graphene, boron nitide. GaSe, TaS 2 , and boron nitride are common to the four models in terms of having lowest prediction errors.
The values displayed in Figure 3 might be a contributing factor in influencing the accuracy of predicting the bilayer properties (such as in the case of under-represented or over-represented monolayers). However, the trends of errors displayed in Figure 6 show that the accuracy of the models are somewhat independent from the counts in Figure 3.
Using the four ML models developed thus far, we run a prediction of the interlayer distances for all of the possible 1431 bilayers based on BR1, and display the results in Table IV. The NN predicts the minimum mean value of the four models, and the minimum d predicted by the RVM model is the lowest of the four minima (1.753Å). The SVM gives the most accurate prediction for presitine bilayers (1.981%), followed by RF (2.936%). Table V shows that, in the set of bilayers with lowest d, all models have bilayers with Pt and Pd atoms. We analyse the respresentation of the Pd and Pt atoms in the 5% percentile groups. The NN and SVM models have a lower representation of Pd than the RF and RVM models (that is, the number of Pd atoms in the NN and SVM predictions is 36 and 33 vs. 41 and 39 Pd atoms in the predictions by the RF and RVM, respectively), while all four models have almost the same represenation of Ptbased bilayers (24, 25, 25 and 23, in NN, RF, SVM and RVM, respectively). As for the 5 bilayers with the highest d, all models predict that WTe 2 -WTe 2 belongs to this group. In the predictions of the RF and RVM models, there are 3 bilayers that have the WTe 2 monolayer, and this is not the case in the other models. However, in the 95% percentile of bilayers, the number of WTe 2 monolayers is largest in the RF model (18 bilayers) while there are 10, 12 and 13 in the NN, SVM and RVM mod-  els. This monolayer is also absent in the 5% percentile (except that there is 1 bilayer in the 5% percentile of the RVM model).
As for the chalcogenide atom in TMDCs, the Te atom is the primary vdW repellant, because much more of the Te-based monolayers are in the 95% percentile (52, 54, 48, and 50 bilayers predicted by NN, RF, SVM and RVM, respectively) compared with those in the 5% percentile (8, 13, 11 and 14 bilayers predicted by NN, RF, SVM and RVM, respectively). On the contrary, the presence of the S atom acts as a vdW attractor (it exists in 65, 67, 66 and 68 bilayers predicted by NN, RF, SVM and RVM, respectively in the 5% percentile, compared with 0, 1, 2 and 1 bilayers predicted by NN, RF, SVM and RVM, respectively in the 95% percentile). The Se is almost boarder-line: it exists in 35, 26, 34 and 24 bilayers predicted by NN, RF, SVM and RVM, respectively in the 5% percentile, compared with 12, 14, 16 and 16 bilayers predicted by NN, RF, SVM and RVM, respectively in the 95% percentile.

B. Prediction of the band gap
For the band gap prediction, we applied the BR1 bilayer represenation. Using the LASSO algorithm, we obtained 11 significant descriptors per bilayer (which is much less than the 35 descriptors obtained per bilayer for the d prediction). Those descriptors are listed in the Supporting Information. However, before we discuss the prediction results of the ML models, we discuss the band gap values obtained using DFT.
Some functionals within DFT are known to underestimate the band gap by a considerable amount. 43 For example, it predicts a bandgap of ∼ 4.5 eV for hexagonal boron nitride, while its experimental band gap is ∼ 6 eV. 44 To overcome this problem, hybrid functionals that mix the DFT exchange with an exact exchange component have been demonstrating impressive agreement with experimental band gaps. 45 However, the present implementations of hybrid functionals are very costly to run. In the present work, we restrict ourselves to the application DFT for the prediction of the band gap, showing that the success of ML models to predict such quantity using DFT should indicate a similar success in the prediction of band gaps calculated using more accurate functionals, such as hybrid functionals.
The DFT band gap value for the boron nitride bilayer is 3.935 eV for the AB' stacking configuration, which is close to the value of 4.01 eV obtained for same stacking (though using a different vdW method) reported in a recent work. 46 The equilibrium stacking reported in Ref. 46 is the AA' stacking, whose band gap is 4.341 eV. However, using the TS vdW method, the lowest energy stacking configuration is AB', and is only 3 meV lower in energy than AA'. To ensure consistency, we adopt the minimum energy stacking of the TS method.
The band gap of each of the bilayers corresponds to a specific band gap alignment of the two constituting monolayers. While the alignments yield a band gap that is smaller than that of the monolayers, as is typically the case in semiconducting interfaces, the following 8 bilayers have a zero band gap: HfS 2 -MoTe 2 , HfS 2 -WTe 2 , MoSe 2 -TiS 2 , MoTe 2 -1T-HfS 2 , 1T-HfS 2 -WTe 2 , TiS 2 -WSe 2 , TiS 2 -ZnO, and TiSe 2 -WTe 2 . These bilayers are interesting because they exhibit a special kind of type III band alignment, 47 where two semiconducting interfaces form a metallic structure across the vdW vacuum. We will explore these structures in detail in a future contribution. We display the band alignment for two of them, namely HfS 2 -MoTe 2 and MoSe 2 -TiS 2 , in Figure 7(a,b). Here, the conduction band minimum (CBM) of one layer (HfS 2 in Figure 7(a) and TiS 2 in Figure 7 Figure 7(b)) are aligned, leading to a zero band gap in an interface between two semiconductors.
Next, we discuss the ML predictions of the band gap, displayed in Table VI and Figure 8. Even with such a small data set, the values of R 2 are quite reasonable for all four ML models, and the MSE values for the test sets are close to those in Tables II and III. The RF and SVM models have the highest prediction accuracy, based on  the R 2 = 0.9 value and the MSE = 0.04 value for each of the two models. However, the fact that R 2 = 1.0 for the training set prediction of the SVM alerts us that this model is over-fitted, and might not perform as well for bilayers outside of the present data set. Figure 8 displays the correlation between the values of the band gap obtained by DFT and those obtained by each of the four ML models. The SVM and RVM models have the highest accuracy in predicting the zero band gaps, as shown in Figure 8(c) and (d) (there are 3 such band gaps in the test set of 14 records), while the RF model is the least accurate in this respect. However, SVM and RVM greatly underestaimate the band gap of boron nitride-MoS 2 , which is 1.867 eV by 0.6 eV and 0.7 eV, respectively. RF and NN, on the contrary, predict it within an error of 0.3 eV and 0.2 eV, respectively.
We analyze the prediction of the band gap for all of the possible 20 × 21/2 = 210 bilayers, based on the BR1 representation, in Table VII. It can be seen that NN, RVM and SVM all predict a negative value for the band gap as the minimum in the set, and this is unphysical. This minimum value is only −0.15 eV, however, and is most  likely because the bilayer is metallic. This is analogous to the test and training sets here the ML models predict bandgaps of 0.1 eV where DFT returns a metallic result. The maximum band gap, which is that of the boron nitride bilayer, is correctly predicted by the NN, RVM and SVM models but not the RF model.
The distribution of the prediction data varies greatly with the ML model, as can be seen in the mean and percentile values. This is unlike the situation in Table  IV, where those values were very close across the models. Such volatility can be seen in the values of the standard deviation in Table VII compared to Table IV. This can be attributed to the small size of the data set that was used to train the models for band gap prediction.

IV. CONCLUSION
In the present work we have demonstrated how machine learning approaches can be effectively used to predict structural and electronic properties of van der Waals heterostructures. More specifically, we have used DFT calculations to 53 monolayers and a small subset, 267, of possible bilayers to train machine learning approaches to predict the interlayer spacing and bandgap for bilayers built from all possible 1431 combinations of the monolayers.
The use of property labelled materials fragments 30 as descriptors for the monolayers proves to be very effective, yielding prediction R 2 of 0.83 and 0.90 for the interlayer spacing and bandgap respectively.
Hybrid materials built from 2D monolayers are gaining attention as novel materials with tunable properties. The current bottle-neck to exploring the possibilitie these materials have is the vast parameter space of possible combinations. While electronic structure calculations are required to answer this question, even the most efficient available methods such as DFT are too time-consuming. The current results show great promise for a combined DFT/machine learning approach to solve this problem.