DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions

We present Descending from Stochastic Clustering Variance Regression (DiSCoVeR), a Python tool for identifying high-performing, chemically unique compositions relative to existing compounds using a combination of a chemical distance metric, density-aware dimensionality reduction, and clustering. We introduce several new metrics for materials discovery and validate DiSCoVeR on Materials Project bulk moduli using compound-wise and cluster-wise validation methods. We visualize these via multi-objective Pareto front plots and assign a weighted score to each composition where this score encompasses the trade-oﬀ between performance and density-based chemical uniqueness. We explore an additional uniqueness proxy related to property gradients in chemical space. We demonstrate that DiSCoVeR can successfully screen materials for both performance and uniqueness in order to extrapolate to new chemical spaces.

Many of the algorithms used for materials discovery in the literature are Euclidean-based Bayesian optimization schemes which seek a trade-off between high-performance and high-uncertainty regions [4,9,11,29,34,[46][47][48][49][50][51], thereby favoring robust models and discovery of better candidates, but not explicitly favoring discovery of novel compounds.
Kim et al. [52] introduced two metrics for materials discovery: predicted fraction of improved candidates and cumulative maximum likelihood of improvement.These metrics are geared at identi-fying "discovery-rich" and "discovery-poor" design spaces in the context of high-performance rather than chemical distinctiveness.
In this work, we introduce the Descending from Stochastic Clustering Variance Regression (DiS-CoVeR) algorithm, which unlike previous methods, screens candidates that have a high probability of successful synthesis while enforcing -through the use of a novel loss function -that the candidates exist beyond typical materials landscapes and have high performance.In other words, DiSCoVeR acts as a multi-objective screening where the promise of a compound depends on both having desirable target properties and existing in sparsely populated regions of the cluster to which it's assigned.This approach then favors discovery of novel, highperforming chemical families.

Methods
DiSCoVeR depends on clusters exhibiting homogeneity with respect to chemical classes, which we enforce via a recently introduced distance metric: Element Mover's Distance (ElMD) [53].Dimensionality reduction algorithms such as Uniform Manifold Approximation and Projection (UMAP) [54] or t-distributed stochastic neighbor embeddings [55] can then be used to create lowdimensional embeddings suitable for clustering algorithms such as Hierarchical Density-based Spatial Clustering of Applications with Noise (HDB-SCAN*) [56] or k-means clustering [57].
Finally, these can be fed into density estimator algorithms such as Density-preserving Uniform Manifold Approximation and Projection (DensMAP) [58] a UMAP variant or kernel density estimation [59,60] where density is then used as a proxy for chemical uniqueness.
Additionally, we describe our data and validation methods.By combining a materials suggestion algorithm and DiSCoVeR, it is possible to assess the likelihood of a new material existing relative to known materials.
The workflow for creating chemically homogeneous clusters is shown in Figure 1.

Chemically Homogeneous Clusters
How are chemically homogeneous clusters achieved?The key is in the dissimilarity metric used to compute distances between compounds.Recently, ElMD [53] was developed based on Earth Mover's or Wasserstein Distance; ElMD calculates distances between compounds in a way that more closely matches chemical intuition.For example, compounds with similar composition templates (e.g.XY 2 as in SiO 2 , AlO 2 ) and compounds with similar elements are closer in ElMD space.In other words, clusters derived from this distance metric are more likely to exhibit in-cluster homogeneity with respect to material class which in turn allows in-cluster density estimation to be used as a proxy for novelty.
In this work, we use UMAP for dimensionality reduction and HDBSCAN* for clustering similar to the work by Hargreaves et al. [53] 1 which successfully reported clusters of compounds that match chemical intuition.

Density-preserving Uniform Manifold Approximation And Projection
A multivariate normal probability density function is assigned to each datapoint embedded in DensMAP space (Eq.( 1)): where X, µ, Σ, and • represent DensMAP embedding position at which to be evaluated, train or validation DensMAP embedding position, covariance matrix, and tensor product, respectively.The covariance matrix used in this work is given by Eq. ( 2): where r represents extracted DensMAP radius.By evaluating the sum of densities contributed by all of the training points evaluated at each of the validation locations (Eq.( 3)): where X v,j , µ t,i , Σ t,i , •, and n train represent j-th validation DensMAP embedding position at which to be evaluated, i-th train DensMAP embedding position, i-th train covariance matrix, tensor product, and total number of train points, respectively.we obtain a proxy for chemical uniqueness relative to existing materials.By combining highfidelity CrabNet predictions of bulk modulus with DensMAP validation densities, we extract a list of promising compounds at the Pareto front -the line or "front" at which the trade-off between performance and chemical uniqueness is optimal.
Additionally, by performing leave-one-clusterout cross-validation, we accurately sort the list of validation clusters by their average performance with a scaled sorting error of approximately 1 %.This proof-of-concept strongly suggests that DiS-CoVeR will successfully identify the most promis-ing compounds when supplied with a set of realistic chemical formulae that partly contains out-of-class formulae produced via a suggestion algorithm.To our knowledge, this is a novel approach that has never been used to encourage new materials discovery as opposed to incremental discoveries within known families.

k-Nearest Neighbor Average
An average of the bulk moduli for the k-nearest neighbor (kNN) is computed as a poor man's gradient as one type of proxy for chemical uniqueness.In this work, we use k = 10 to define the local neighborhood of influence, where kNNs are determined via the ElMD.Compounds which exhibit high predicted target bulk moduli relative to their kNNs are considered unique in terms of property gradient, despite having similar chemical structure.
Because it is based on nearest neighbors rather than a defined radius, compounds which are in relatively sparse UMAP areas may have neighbors from a chemically distant cluster.In this case, if all kNNs come from the same cluster, and this cluster exhibits similar properties, this can skew the measure to some extent.This artifact can be avoided by instead using a defined radius and a variable number of kNNs while ignoring compounds which have no kNN within the specified radius.

Cluster Properties
Cluster validation fraction is given by Eq. ( 4): where f k , n val,k , and n train,k represent validation fraction of the k-th cluster, number of validation points in the k-th cluster, and number of training points in the k-th cluster, respectively.This indicates to what extent a given cluster consists of unknown compounds and can be useful in identifying clusters which are chemically distinct from existing compounds.Cluster target mean is given by Eq. ( 5): where n k , E avg,k , and E k,i represent number of points in the k-th cluster, mean bulk modulus of k-th cluster, and bulk modulus of the i-th point in the k-th cluster, respectively.This is useful for identifying clusters that exhibit overall high performance.

Data and Validation
As a proof of concept, we use 10 710 unique chemical formulae and associated bulk moduli from Materials Project [62,63] to test whether DiSCoVeR can find new classes of materials with high performance.The highest bulk modulus was chosen when considering identical formulae.
We split the data into training, validation, and test sets in accordance with materials informatics best practices [64] using a 0.7/0.2/0.1 train/val/test2 split as well as via leave-one-cluster-out cross-validation (LOCO-CV).We report two types of of validation tests as summarized in Table 1.One of the validation methods uses a weighted root-mean-square error (RMSE) of various multiobjective Pareto front properties (target vs. chemical uniqueness proxy).The target is weighted at twice that of the proxy property to favor consideration of high performing candidates (Eq.( 6)): where w E , w p , n val , E true,i , E pred,i , p true,i , and p pred,i represent bulk modulus weight, proxy weight, number of validation points, DFT-calculated bulk modulus of the i-th validation point, predicted bulk modulus of the i-th validation point, true proxy property of the i-th validation point, and predicted proxy property of the i-th validation point, respectively.
In the current implementation, however, the chemical uniqueness proxy is determined a-priori and simultaneously using the full dataset; thus, the error contribution from the chemical uniqueness proxy is zero.This approach is reasonable for small-to medium-sized datasets (e.g.<20 000), but can quickly become intractable for large datasets due to memory constraints.We plan to modify DiSCoVeR to be compatible with large datasets in near future work by utilizing the ElMD metric directly within DensMAP rather than computing a pairwise distance matrix in advance.
Likewise, the score for each compound is a weighted sum of the scaled target and proxy properties (Eq.( 7)): where w E , w p , E i , and p i represent bulk modulus weight, proxy weight, predicted bulk modulus of the i-th validation point, and predicted uniqueness proxy of the i-th validation point, respectively.
The other validation method is a LOCO-CV approach using cumulative density function (CDF) distance (i.e.Earth Mover's or Wasserstein distance) as a metric to determine the sorted similarity of a predicted cluster property vs. a true cluster property using scipy.stats.wassersteindistance() [65] as follows: where avg_true and avg_pred represent the 1D array of DFT-calculated average bulk moduli for each cluster and the 1D array of predicted average bulk moduli for each cluster, respectively, given by Eq. ( 5).The code was formatted via an online formatter (https://black.vercel.app/).The use of a cumulative sum causes the positions of high cluster bulk modulus averages to be further spaced apart and therefore more costly to "move earth" between the two distributions.In other words, inaccuracies associated with high performing clusters are weighted more heavily than inaccuracies for low performing clusters.This weighted error is then scaled by dividing by a "dummy" error, where v_weights is replaced by the average bulk modulus of the training data for each of the training splits (as opposed to the predictions on the validation data).
We use CrabNet [44] as the regression model for bulk modulus which depends only on composition to generate machine learning features; however, one of the other models mentioned in Section 1 could have been used instead.

Results and Discussion
We present characteristics of the DensMAP embedding and clustering scheme (Section 3.1), followed by compound-wise (Section 3.2) and clusterwise (Section 3.3) Pareto front results.Finally, we discuss results of the LOCO-CV scheme.

Density-preserving Uniform Manifold Approximation And Projection Characteristics
We present a DensMAP clustering of ElMD distances between all pairs of compounds (Figure 2a), plot the cluster count histogram (Figure 2b).We then sum densities at equally spaced locations across DensMAP space (Figure 3a) and color the points according to bulk modulus values (Figure 3b).
We obtain a total of 24 clusters, plus a noncluster of unclassified points comprising a small percentage of the data.The number of clusters gives an estimation of the number of distinct chemical classes present in the dataset and is also affected by DensMAP and HDBSCAN* model parameters such as local density regularization strength (dens_lambda) and minimum cluster size (min_cluster_size).The unclassified points are typically isolated points in DensMAP space.In other words, unclassified points will likely exhibit high chemical contrast relative to other compositions via a low density proxy.We discuss this further in Section 3.2.

Compound Pareto Fronts
We present compound-wise Pareto frontsa common technique used in multi-objective optimization-with predicted bulk modulus as the ordinate and one of two compound-wise proxies as the abscissa: train contribution to validation log density (Figure 4a) and k-nearest neighbor average (Figure 4a) as described in Section 2.2.
On the other hand, k-nearest neighbor average acts as a poor man's gradient -in other words, used training densities evaluated at the validation location in the embedded DensMAP space.For the k-neighbors data, the average of the 10 nearest neighbor properties were used as a proxy.† cluster validation fraction refers to the ratio of number of validation points within a cluster (as opposed to training points) to the total number of points in the cluster.DensMAP densities and cluster fractions are determined simultaneously for both validation and training sets during the DensMAP embedding resulting in computational throughput restrictions.In other words, "predicted" and "true" are identical due to implementation of DiSCoVeR at the time of writing.We plan to address this in future work.       in conjunction with target predictions, it emphasizes compounds which have much higher predicted bulk modulus than that of its neighbors.In addition to the Pareto front, a parity line is also plotted.Compounds which are far above the parity line are high-performing relative to the surrounding neighborhood.

Method
In terms of discovering materials which are chemically distinct from existing materials, train contribution to validation log density is the preferred proxy.We note that each of the proxies produce distinct plots.In the case of Figure 4a, clusters tend to be stratified horizontally, whereas in Figure 4a, cluster shapes exhibit similar orientations.As expected (Section 3.1), unclassified points appear frequently at or near the first Pareto front owing to the fact that unclassified points are likely to have a lower density proxy and therefore higher score.By contrast, unclassified points appear infrequently at or near the latter Pareto front.Additionally, the unique list of clusters present at the Pareto front are different for each plot.In other words, these are two types of chemical uniqueness -the first emphasizing chemical "distance" from other compounds and the latter emphasizing performance superiority over chemically similar compounds.We believe that either may be successfully used in the domain of materials discovery.
Compounds were assigned scaled discovery scores as described in Section 2.3 for each of the chemical uniqueness proxies.The top-10 ranked candidates for the density and peak proxies are given in Tables 2 and 3, respectively.An outer merge of these two lists is given in Table 4.
It is interesting that while the ranking order and scores are different for the two lists, 10 of the compounds are shared with only 3 that are uncommon between the two lists.In other words, both proxies identify similar compounds but with differing priority.This suggests that low-density regions of the space also exhibit low average bulk modulus of the surrounding neighborhood (which necessarily is a larger region).It is likely that by replacing the peak proxy (i.e.kNN average proxy) with a radius-based neighborhood average, a very different set of rankings will result.We expect that by a proper choice of radius, many of the previously

Cluster Pareto Front
We also present Pareto fronts for cluster-wise properties.For the ordinate, we use predicted cluster average bulk modulus Figure 5.For the abscissa, we use cluster validation fraction as a proxy for chemical distinctiveness of a cluster.In this example, the data is clustered tightly in the abscissa due to a the train/val split being applied randomly without regard to cluster.In a more realistic scenario with much more validation data than training data, where the validation encompasses previously unexplored chemical spaces, there is likely to be a larger spread.Indeed, such a use-case is the intention for this visualization tool.There is a much wider spread in the ordinate, indicating an interesting feature of the clustering results: compositions which are chemically similar to each other also tend to have, on average, similar bulk moduli.This is unsurprising, especially since the regression model used is based purely on composition.
In future work, it may be interesting to replace average bulk modulus with best-in-cluster bulk modulus to explore a different type of highranking clusters.

Leave-one-cluster-out Cross-validation
Finally, we perform LOCO-CV to evaluate the utility of the DiSCoVeR method in identifying clusters with high average cluster bulk modulus.We accurately sort the list of validation clusters by their average performance with a weighted scaled sorting error (Section 2.3) of approximately 1 %.In other words, the out-of-cluster regression with a larger weight placed on higher performance is very accurate.This suggests that CrabNet can successfully extrapolate performance predictions for new chemical spaces in accordance with the goal of DiS-CoVeR.In future work, we plan to also test the out-of-cluster extrapolation performance for chemical uniqueness proxies (Section 2.3).

Conclusion
We embedded ElMD distances in DensMAP space and clustered via HDBSCAN* to identify chemically similar clusters for 10 710 compositions.We introduced new proxies (i.e.metrics) for uniqueness-based materials discovery in the form of train contribution to validation log density, k-neighbor averages, and cluster validation fraction.By pairing these with the CrabNet regression model, we visualize Pareto plots of predicted bulk modulus vs. uniqueness proxy and obtain weighted uniqueness/performance rankings for each of the compounds.This reveals a new way to perform materials discovery with a focus towards identifying new high-performing, chemically distinct compositions.

Figure 2 :
Figure 2: Summary of cluster properties.(a) DensMAP embeddings based on ElMD distances between compounds colored by cluster.(b) Histogram of number of compounds vs. cluster ID, colored by cluster.

Figure 3 :
Figure 3: Density and bulk modulus.(a) DensMAP densities of both training and validation points summed at gridded locations in DensMAP space.(b) 10 710 bulk moduli of training and validation points embedded in DensMAP space.

Figure 4 :
Figure 4: Compound-wise Pareto plots.(a) Pareto plot of validation bulk modulus predictions (GPa) vs. train contribution to validation log-density, colored by cluster.The Pareto front is plotted as a dashed line.(b) Pareto plot of training and validation bulk modulus predictions vs. kNN average bulk modulus (GPa) where k = 10.The Pareto front is given by a dashed line.A line of parity is given by a solid teal line to emphasize that compounds well above this line are considered unique.

Figure 5 :
Figure 5: Cluster-wise average bulk modulus predictions (GPa) vs. cluster-wise validation fraction.This emphasizes the trade-off between high performing clusters and chemically unique clusters relative to the original data.

Table 1 :
Validation methods, splits, notion of best fit, and property used to calculate notion of best fit.* This density is the sum of all

Table 4 :
Outer merge of top-10 ranked high performing, density-proxy and peak-proxy candidates.Formula, density discovery score (s ρ ), and peak discovery score (s kNN ).