Bayesian optimization of nanoporous materials

Nanoporous materials (NPMs) could be used to store, capture, and sense many different gases. Given an adsorption task, we often wish to search a library of NPMs for the one with the optimal adsorption property. The high cost ofNPMsynthesis and gas adsorptionmeasurements, whether these experiments are in the lab or in a simulation, often precludes exhaustive search. We explain, demonstrate, and advocate Bayesian optimization (BO) to actively search for the optimal NPM in a library of NPMs– and find it using the fewest experiments. The two ingredients of BO are a surrogate model and an acquisition function. The surrogate model is a probabilistic model reflecting our beliefs about the NPM-structure–property relationship based on observations from past experiments. The acquisition function uses the surrogate model to score each NPM according to the utility of picking it for the next experiment. It balances two competing goals: (a) exploitation of our current approximation of the structure-property relationship to pick the highest-performing NPM, and (b) exploration of blind spots in the NPM space to pick an NPM we are uncertain about, to improve our approximation of the structure-property relationship. We demonstrate BO by searching an open database of∼ 70000 hypothetical covalent organic frameworks (COFs) for the COFwith the highest simulatedmethane deliverable capacity. BO finds the optimal COF and acquires 30% of the top 100 highest-ranked COFs after evaluating only∼120 COFs. More, BO searches more efficiently than evolutionary and one-shot supervised machine learning approaches.


Introduction
The selective gas adsorption properties of nanoporous materials (NPMs) endow them with many possible applications in the storage [ , ], separation [ ], and sensing [ ] of gases. As examples, promising applications of NPMs include (i, storage) densifying hydrogen (H 2 )-a clean fuel-for compact storage onboard vehicles [ , ], (ii, separation) capturing carbon dioxide from flue gas of coalfired power plants; subsequently sequester it to prevent global warming [ ], and (iii, sensing) detecting toxic compounds and explosives [ , ]. The many topologies [ , , ], abundance of molecular building blocks, and post-synthetic modifiability [ , ] permit an unlimited number of possible structures exhibiting diverse adsorption properties.
A common goal is to find, among a large set of candidate NPM structures, the NPM structure(s) with the optimal adsorption property for a given application. As opposed to an exhaustive search, our goal is to search for the optimal NPM efficiently, by consuming minimal resources (computational and/or physical) in the process. In the laboratory setting, synthesizing an NPM and measuring its property costs labor and raw materials, and throughput is limited by the capital equipment in the lab. In the computational setting, constructing a high-fidelity computational model of an NPM structure [ -] and predicting its gas adsorption property through molecular simulations [ , ] consumes electricity, and throughput is limited by computing resources. Thus, both in the bona fide laboratory and on the computer, our goal is to find the optimal NPM(s) for a given adsorption task using the fewest experiments (experiment = constructing an NPM and evaluating its adsorption property).
In this article, we explain, demonstrate, and advocate Bayesian optimization (BO) to actively and efficiently search for NPMs with an optimal property for a given adsorption task. Active, because each BO iteration performs three consecutive steps: (a) conducting an experiment on an NPM, (b) updating our belief about the structure-property relationship, and (c) selecting the NPM for the next experiment. See Fig. for a high-level illustration. Efficient, because BO makes a data-informed decision on the NPM to select for the next experiment, while balancing: (a) exploitation of our current (uncertain!) data-driven belief about the structure-property relationship to pick an NPM that might have the optimal property and (b) exploration of regions of the NPM design space where our belief about the structure-property relationship is weak to pick an NPM we are uncertain about. Figure : Illustration of a single iteration of active search in Bayesian optimization (BO) of nanoporous materials. First, we conduct an experiment that measures the property f (x n ) of the material represented by x n . Second, using the new observation, we update our belief about the underlying objective function f (x), reflected by the surrogate model. Third, we use the acquisition function to pick the next material for an experiment. BO is an active search method to find the material x that optimizes the black-box function f (x).
Monte Carlo tree search. When the NPM search space can be framed as a tree, Monte Carlo tree search [ ] is more efficient than random search. Starting at the root node, a policy to select a child node is iteratively applied, giving a path through the tree, culminating at a leaf node. The experiment is then conducted on the NPM represented by the leaf node. Its measured property, viewed as a reward, is back-propagated through the tree to update the statistics of each node along the path to it. Both the visit counts and reward allocations of the nodes are used in the tree policy to balance exploration of new branches of the tree that have not been visited often (or at all) and exploitation of current knowledge to follow branches of the tree that appear likely to lead to optimal NPMs. See Refs. [ , ] as examples.
Each of these prior approaches suffer from drawbacks. The supervised machine learning approach selects training data to learn a good predictor of the property from the structure representation, then uses the predictor to greedily acquire the COFs with the highest predicted property. This passive approach can be viewed as one round of exploration and one round of exploitation. Active learning [ ] can be used to reduce the number of training examples to learn the structure-property relationship but is not geared towards finding the optimal NPM using the fewest experiments. In BO, we will actively collect training examples according to the goal of finding the optimal NPM with the fewest experiments. Genetic algorithms are sequential, active search procedures aimed at quickly finding the optimal NPM. However, the genetic operations are heuristic and do not balance exploration and exploration rigorously. As a consequence, genetic algorithms can be difficult to tune and could require many experiments to find the optimal NPMs. MCTS balances exploration and exploitation more rigorously. But, it requires many NPM evaluations to identify promising regions of the tree because it does not explicitly leverage the similarity among structures for principled exploitation. Further, MCTS is limited to NPM design spaces which can be framed as a tree.

Problem setup: find the optimal material
Suppose we have a large database of candidate NPM structures, X , for some adsorption task. Let f : X → R be a black-box objective function that, given a candidate NPM x ∈ X , returns the relevant adsorption property y = f (x). Each evaluation of f corresponds to performing an expensive experiment-either in the laboratory or in a molecular simulation-to measure the adsorption property y of NPM x. Our goal is to find the highest-performing NPM x * from X that maximizes the objective function f , while conducting the fewest number of expensive experiments.
We can interpret f (x) as the [unknown] structure-property relationship [ , ] since x [abstractly, at this point in our discussion] represents the structure of a unique NPM and evaluating f means conducting an experiment to measure its property, y .

Overview of Bayesian Optimization
We explain the key ideas behind the Bayesian Optimization (BO) framework [ , ] to find the highest-performing NPM-solve the problem in eqn. -efficiently, by using the fewest (expensive) experiments.
. Defining an NPM feature space While x as an abstract representation of an NPM suffices for defining the problem in eqn. , for BO we must concretely define a NPM feature space or search space in which each NPM x lies.
Without loss of generality, take x to be a fixed-size (among all NPMs) vector representation of the NPM that lies in a continuous feature space. The NPM feature vectors {x} should be designed to ( ) encode the relevant structural and chemical features of the NPMs, ( ) be rotation-, translation-, and, if the NPM is a crystal, replication-invariant, ( ) be cheap to compute compared to conducting the experiment (evaluating f ), and ( ) ideally, each correspond to a unique NPM (mapping NPM → x is injective) [ ]. As a result, NPMs close in the feature space should exhibit close values of the property y .
The simplest example of a representation x of an NPM is a list of hand-designed, based on domain expertise, descriptors/features of its structure and composition, such as pore volume, surface area, largest included sphere diameter, density, weight fraction carbon, etc. [ -]. Alternatively, x could be learned from a graph [ -] NPMs and Refs. [ , , , -] for different examples of NPM feature spaces. As opposed to dwelling on how to define a good feature space of NPMs, we will instead focus on BO, a technique to search the defined feature space for the optimal NPM x * in an efficient manner.

. Active search: exploitation vs. exploration
Even with an NPM feature space defined, in practice, the structure-property relationship f (x) is a black-box function; analytical expressions for f (x) and/or its gradient ∇ x f (x) are not known, and it may be multi-modal.
BO is a derivative-free method to actively and efficiently search the database of NPMs X for the NPM x * that maximizes f (x). Active, because BO sequentially selects NPMs from X for experimentation (to evaluate with f ), iterating between conducting an experiment and making a decision about which experiment to conduct next. Efficient, because BO makes a data-driven decision to select the next NPM for an experiment while taking into account all observed (NPM x, property y = f (x)) pairs from previous experiments. Each decision to select the next unevaluated NPM from X to evaluate with f must trade off two conflicting goals: ) Exploitation suggests to use our current, but uncertain, approximation of the structure-property relationship, based on the past observations, to select the NPM that appears to have the most promise as a high-performing material.
) Exploration suggests to select the NPM that we are most uncertain about to improve our approxi-mation of the structure-property relationship. So, to balance exploitation and exploration, we must balance visits to regions of NPM feature space that (i) appear to contain high-performing NPMs and (ii) have not been explored well. A colloquial example of the exploitation-exploration dilemma in our lives is, in aiming to maximize our enjoyment of food, whether to choose a restaurant that we have visited and know we like versus a new one [ ]. .

The ingredients of BO for data-driven decision-making: a surrogate model and an acquisition function
In the BO framework, the two key components used to make each sequential decision of which NPM to conduct an experiment on next are ( ) a surrogate model that captures our beliefs, based on past observations, about the structure-property relationship and ( ) an acquisition function that scores each NPM according to the utility of conducting the experiment on it next. The acquisition function uses the surrogate model of the structure property-relationship f (x) to decide which NPM to evaluate next while striking a balance between exploitation and exploration.
The surrogate model. The surrogate model is a probabilistic model of the structure-property re- with meanŷ ∈ R and variance σ 2 ∈ R. The surrogate model reflects our current beliefs about f (x) and serves two purposes in BO. First, to guide exploitation,ŷ (x) cheaply estimates the properties of the remaining, unevaluated NPMs, i.e.,ŷ (x) is a cheap-to-evaluate approximation of the expensive-to-evaluate objective function f (x). Second, to guide exploration, σ 2 (x) quantifies the uncertainties in the predicted properties of the unevaluated NPMs. This makes us aware of "blind spots" of the surrogate model-regions in the NPM feature space we need to explore to improve our approximationŷ (x) and reduce the uncertainty in our beliefs about f (x).
The surrogate model is updated in each iteration of BO, after the new observation (x n+1 , y n+1 = f (x n+1 )) is gathered, to (i) improve the approximation of f (x) and (ii) account for the reduced uncertainty in the region of the feature space surrounding the newly evaluated NPM x n+1 . Consequently, let us denote the surrogate model after iteration n of BO asf n : x → (ŷ , σ).
The acquisition function. The acquisition function A(x;f (x)) : X → R scores the utility of, next, evaluating NPM x ∈ X with the expensive objective function f . Here, "utility" is defined in terms of our ultimate goal of finding the optimal NPM x * in eqn. with the fewest experiments. The acquisition function employs the prediction of the propertyŷ and the associated uncertainty σ 2 from the surrogate modelf (x) to assign a utility score to the NPM that balances exploitation and exploration, respectively. Maxima of the acquisition function are located in regions of NPM feature space where the predicted property is large and/or uncertainty is high.
Without loss of generality, the observations are assumed noise-less for clarity of presentation.
The decision of which NPM to evaluate next is made by maximizing the acquisition function: . , x n } is the acquired set of n NPMs that have been evaluated already. Importantly, the acquisition function must be cheap to evaluate and optimize (if X is not a finite set). .

Summarizing: BO active search iterations
Fig. illustrates an iteration of BO. At the beginning of iteration n, we conduct an experiment on NPM x n , i.e., we evaluate NPM x n with the objective function f to obtain a new observation (x n , y n = f (x n )). Next, we update the old surrogate modelf n−1 (x) to account for this new observation, giving the new surrogate modelf n (x). We then select the next (unevaluated) NPM to evaluate, x n+1 , as the one that maximizes the acquisition function A(x;f n (x)).
We terminate BO after we either (i) expend our budget for experiments or (ii) find a material with a satisfactory property value. The BO solution to the problem in eqn. , x * , then follows from the evaluated NPM with the highest observed property, arg max N i=1 y i , where N is the number of BO iterations (=experiments) performed. Some theoretical work focuses on characterizing how, under specific assumptions, the quality of the approximate optimum in BO scales with the number of iterations [ ].
N.b., typically, the surrogate model is retrained from scratch after each iteration of BO, but some surrogate models can be trained online [ ], reducing the computational cost of the search. .

Remark: BO vs. active learning
We remark on a distinction between active learning [ ] and Bayesian optimization. Both sequentially collect training examples for a supervised machine learning model. In active learning, the examples are efficiently collected with the goal of reducing the uncertainty in the machine learning model. In Bayesian optimization, the examples are efficiently collected with the goal of, instead, finding the optimal material. BO is more efficient for finding the optimal material than active learning because it avoids collecting examples in regions of feature space that contain poor-performing materials, whereas active learning will do so to reduce the uncertainty of the model in those regions.

Surrogate models and acquisition functions
In this section, we explain surrogate models and acquisition functions commonly used in BO.

. Surrogate models: Gaussian processes
Gaussian processes (GPs) [ , ] are the most commonly used surrogate models in BO owing to their flexibility as function approximators, principled uncertainty quantification, and compatibility with the kernel trick. GPs are non-parametric models that can approximate complicated objective . Through a Bayesian probabilistic framework, GPs provide uncertainty estimates in their predictions and allow incorporation of prior beliefs. GPs rely on a kernel function k(x, x ) : X × X → R [ ] to capture the similarity between any two NPMs x and x . This gives GPs the flexibility to approximate arbitrary, complicated (but well-behaved!) functions f (x). Moreover, it gives GPs versatility in how to represent the NPMs, e.g., graph kernels [ ] can be used for NPMs represented as crystal graphs (e.g., [ ]).

What is a GP?
A GP is a stochastic process that treats the value of the objective function at any given point in feature space, f (x), as a random variable. Specifically, GPs assume the joint distribution of any finite collection of function values, say at points {x 1 , x 2 , ..., x m } on its domain, follows a multi- whose covariance matrix Σ ∈ R m×m is given by the kernel function applied pairwise over the points The kernel function k(x, x ) quantifies the similarity of NPMs x and x ; hence, the idea in GPs is that the properties f (x) and f (x ) of similar (dissimilar) NPMs x and x are highly (un)correlated. [ ] GPs effectively model the entire function f (x)in a pointwise manner-by assuming eqn. holds for any arbitrary, finite collection of function values on its domain. The mean of zero in eqn. assumes the measurements are centered.
From a Bayesian perspective, eqn. is a prior assumption about the structure-property relationship f (x). When we gather new observations, we will update this prior assumption to arrive at the posterior distribution reflecting our beliefs about the structure-property relationships in light of new data.

Kernel functions.
Examples of kernel functions that operate on vector representations of two NPMs x and x include the linear, polynomial, and radial basis function (RBF) kernels: radial basis function (RBF) kernel ( ) Each kernel possesses the hyperparameter σ f , the signal variance, which is a scale factor controlling the expected range of the functions represented by the GP. The polynomial kernel has a hyperparameter d that controls the order of the polynomial in the features, and the RBF kernel contains a length-scale hyperparameter γ that controls how close x and x must be in the feature space to be considered "similar" and the expected roughness of the functions represented by the GP. Implicitly, each nonlinear kernel maps the two vectors x and x into a new, higher-dimensional feature space through a mapping τ , then takes the inner product of the vectors in the new feature space: from previous experiments (previous iterations of BO), with y i the measured property of NPM x i . Under the Bayesian view, y i is a noise-free observation of the random variable f (x i ). We wish to know the distribution of the random variable f (x) for an unevaluated NPM x, to determine the utility of evaluating it in the next experiment. Imposing the assumption in eqn. for a specific collection of points on the domain of f (x) composed of (i) the n evaluated NPMs from the past experiments {x 1 , x 2 , ..., x n } and (ii) the unevaluated NPM, x: .., f (x n )] the vector of random variables representing the properties of the previously evaluated NPMs, σ = [k(x, x 1 ), k(x, x 2 ), ..., k(x, x n )] the vector of the kernel between the unevaluted NPM and the previously evaluated NPMs, and Σ the kernel matrix for the previously evaluated NPMs with Σ i,j = k(x i , x j ) ∀i, j ∈ {1, 2, ..., n}. However, we have observations of the random variables in f, y = [y 1 , y 2 , · · · , y n ] . Conditioning the joint distribution in eqn. on the observations y, we arrive at the posterior distribution for the property f (x) of the unevaluated NPM, also Gaussian: We can interpret eqns. and . The predicted property of the unevaluated NPM,ŷ (x), is a linear combination of the observed properties of the evaluated NPMs, y with weights σ Σ −1 . The weight on each measured property depends on the similarity between that NPM and the unevaluated NPM.
The variance σ 2 (x) describing the uncertainty associated with the prediction of the property of the unevaluated NPM x is the prior assumption of k(x, x) reduced by σ Σ −1 σ, which captures the similarity of the unevaluated NPM x with the set of previously evaluated NPMs.
Fig. illustrates a GP model of a toy function f (x), based on an RBF kernel, over a toy one-dimensional NPM feature space X = R, trained on five observations. The mean in the GP,ŷ (x), approximates f (x), and the variance σ 2 (x) expresses uncertainty in the approximation. Generally, the uncertainty is small close to an observation and large when far from an observation.
Note that we can pose GPs that relax the assumption that the observations are noise-free [ ].
Illustration of a Gaussian process (GP) model with an RBF kernel over a toy one-dimensional NPM feature space. The black points are the observed data from a toy structure-property relationship, f (x). The blue line and shaded region visualize the GP model trained on the observed data: the line is the approximationŷ (x) of the structure-property relationship, while the shaded region illustrates the uncertainty by coveringŷ (x) ± 2σ(x). The GP model shows large (small) uncertainty in regions of feature space far from (close to) the observations.

Hyperparameters of GPs.
GPs are non-parametric models, but most useful kernel functions used in GPs contain hyperparameters. For example, the RBF kernel in eqn. has the length-scale γ and the signal variance σ f hyperparameters. To learn the hyperparameters of the kernel that give us the best approximation of f (x), we could perform a grid search over hyperparameter space for those that minimize the validation loss. A more popular and faster way to estimate hyperparameters of the kernel is to maximize the marginal likelihood of the observed data as a function of the kernel hyperparameters. [ ] Generally, at each iteration of BO, the hyperparameters of the kernel are updated to account for the newly acquired observation. is helpful for understanding GPs and sampling functions from the distribution over the function space they describe, we in practice conduct GP inference using the kernel, through eqns. and . E.g. τ (x) is a vector of infinite dimension in the case of the RBF kernel, making the view of GPs in eqn. unfriendly for computations.

Examples of acquisition functions
We provide three common examples of acquisition functions and explain how they use the surrogate model to trade-off exploration and exploitation to select the NPM to evaluate in the next experiment.
Upper confidence bound (UCB). The UCB acquisition function selects the point that maximizes the upper confidence function: whereŷ (x) and σ(x) are the predicted property of NPM x and its associated uncertainty, respectively, provided by the surrogate model. The parameter β explicitly trades-off exploration and exploitation. If β is large, the UCB tends to select NPMs with the highest uncertainty to explore the feature space. If β is small, the UCB tends to select NPMs with the highest predicted property to exploit our current approximation of the structure-property relationship f (x). To clarify the UCB terminology, the top boundary of the shaded region that bandsŷ (x) in Fig. is the UCB for β = 2.

Expected improvement (EI).
Another acquisition strategy is to select the NPM with the highest expected improvement (EI) of the property of the best evaluated NPM so far. Let the random variable I(x) = max(0, f (x) − max i y i ) denote the improvement in the property of a NPM x over the best observed NPM thus far. The EI is then: which can be written in closed form: with Φ and φ the cumulative and probability distribution functions, respectively, of the standard normal distribution. The first and second terms in eqn. , respectively, capture the exploitation and exploration component of EI.

Information-theoretic acquisition functions. The principle behind acquisition functions based
on information theory is to select the NPM x that maximizes the mutual information between (i) the observation of its property y =f (x) and (ii) the location of the NPM x * in feature space that maximizes f (x). Viewing both f (x) and x * as random variables, the following acquisition function describes the mutual information between the observation (x, y = f (x)) of the property of a newly acquired NPM, x, and the location of the optimal NPM, x * .
where MI[·, ·] is the mutual information between two random variables and H(·) is the entropy of a probability distribution p(·). The mutual information is the reduction in the entropy of the probability density function of the location of the optimum NPM, x * , as a result of observing the property Illustrating BO acquisition and the exploration-exploitation tradeoff. Fig. a illustrates the EI acquisition function in a toy one-dimensional NPM space under a GP surrogate model with n = 5 observations. The EI acquisition function exhibits two maxima. The first (global) maximum is in a region of the feature space where the predicted propertyŷ (x) is the largest. The second (local) maximum is where the uncertainty σ(x) is largest. We select the NPM for the next experiment, x n+1 , as the one that maximizes the EI acquisition function. Fig. a shows the acquired NPM assuming the database of NPMs X covers all points on the domain shown. To illustrate how the EI acquisition strategy balances exploration and exploitation, for comparison, we also show the acquired NPM x n+1 if the acquisition strategy were purely exploration and purely exploitation. Pure exploration dictates x n+1 = arg max σ(x), but this NPM has a poor property. Pure exploitation dictates x n+1 = arg maxŷ (x), but this NPM is too close to an existing observation. EI balances the trade-off by picking an NPM with both a high uncertainty and high predicted property.

Experiments and Results
We now demonstrate Bayesian optimization by using it to efficiently search for covalent organic frameworks (

. Experimental problem setup
Our goal is to efficiently search a database of COFs for the one with the largest methane deliverable capacity. The methane deliverable capacity, y . The COF property we wish to maximize is the simulated deliverable capacity of methane [L STP CH 4 /L COF] at K under a bar to . bar pressure swing. The deliverable capacity of the COF primarily determines the driving range of a vehicle on a "full" adsorbed natural gas fuel tank packed with the COF [ ].
The expensive objective function, f (x). Evaluating the objective function f to give y = f (x) involves conducting two grand-canonical Monte Carlo simulations of methane adsorption in the COF structure represented by xone at ( bar, K) and one at ( . bar, K). The deliverable capacity y then follows from the difference in the simulated methane adsorption at the two conditions. The function f is expensive to evaluate, as the run time of the molecular simulations is on the order of hours.
Goal: data-efficient search for the optimal COF, x * . In an exhaustive search, we would conduct expensive molecular simulations to predict the methane deliverable capacity of each candidate COF in the database-i.e., collect {(x, y = f (x)) : x ∈ X }-to find the COF x * ∈ X with the highest deliverable capacity. In contrast to an exhaustive search, instead, our goal is to find the optimal COF x * efficiently-while conducting expensive molecular simulations in only a small proportion of the candidate COFs.
We hypothesize that BO will provide a simulation-efficient search for the optimal COF, x * . In re- Each data lookup, conceptually, represents conducting the two expensive molecular simulations of methane adsorption in a COF ourselves. N.b., that we look up data as opposed to conducting a simulation ourselves has no impact on the BO search efficiency when defined in terms of the number of COF evaluations needed to find the optimal COF. The exhaustive search of Mercado et al. [ ] allows us to readily evaluate the simulation-efficiency of different search strategies to find the optimal COF(s). We will compare the search efficiency of BO to random search, an evolutionary algorithm, and one-shot supervised learning.

Search strategies
We use several different strategies to search for the optimal COF x * exhibiting the highest methane deliverable capacity y in the database X .

Random search.
Random search is a naive baseline. At each iteration, we uniform randomly select an unevaluated COF from X to evaluate.
Random search does not make a data-informed selection of a COF for the next evaluation, as it ignores the past observations, {(x i , y i = f (x i ))}. Thus, random search is expected to perform poorly. In accordance with the assumption behind GPs, we normalize the deliverable capacities to have mean zero and unit variance (using the training data only). During the acquisition phase, we evaluate the acquisition function for each COF in the database and select the COF with the highest value, in contrast to optimizing the acquisition function over the continuous COF feature space.

Evolutionary search (via CMA-ES).
As an evolutionary search method, we use Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [ , ], a state-of-the-art, stochastic optimizer for rugged, non-convex, black-box objective functions. In CMA-ES, new COFs are stochastically selected from the search space by sampling from a multivariate Gaussian distribution over the feature space. The mean and covariance matrix of the distribution are updated over the search process, as COFs are acquired and evaluated in batches (generations). To update the distribution, with the aim of increasing the likelihood of acquiring and taking search steps towards high-performing COFs, (i) the mean is updated using a weighted average of the most high-performing COFs in the generation (a selection mechanism) and (ii) the covariance matrix is updated using a weighted average of the best search steps (from the mean) towards the high-performing COFs [ ].
CMA-ES has two hyperparameters: the initial standard deviation for each feature of the COF representation and the number of new candidate COFs acquired in each iteration (the population size). We initialized the CMA-ES algorithm with a randomly selected COF and set the initial standard deviation to . to cover our COF feature space [0, 1] 12 . The population size, , was determined by a default heuristic in the cma library in Python.
CMA-ES operates in a continuous search space. When it selects a point in COF feature space for the next generation, it does not exactly correspond to a feature vector of a COF in the database; to apply CMA-ES, we select the nearest (in feature space), unacquired COF in the database.

One-shot supervised learning (via RF).
While one-shot supervised learning does not constitute active search like BO, it is the most popular method for circumventing an exhaustive search for the optimal NPM using the high-fidelity evaluation method. One-shot supervised learning has three stages. ( ) COFs are selected and evaluated to serve as training data for the supervised learning model. ( ) The trained model is used to predict the deliverable capacity of the remaining COFs. ( ) We evaluate the COFs with the highest predicted deliverable capacity. Stages ( ) and ( ) both incur (costly) COF evaluations. We compare one-shot supervised learning and active search by comparing the deliverable capacities of their acquired set of COFs when given the same budget of COF evaluations. Much like balancing exploration and exploitation, we elect to split the budget of evaluations for oneshot supervised learning among stages ( ) and ( ) equally. More, for stage ( ), we try two training set acquisition strategies: (i) uniform random selection of COFs and (ii) a max-min diversity selection strategy [ , ], whereby we sequentially acquire COFs for the training set, selecting the COF with the maximum minimum distance (in COF feature space) to a COF already in the training set (starting with an initial, random COF).
As the supervised learning model, we use commonly-used [ -, ] random forests (RFs) ( trees, default parameters in scikit-learn) as the supervised learning model to approximate f (x) using the (differently sized) training sets. .

Results
We now execute each strategy to search for the optimal COF x * exhibiting the highest methane deliverable capacity y in the database X .

. . Search efficiency
The blue curves in Fig. show the search efficiency of BO. Three different search performance metrics are shown as the number of COFs evaluated, n, (= the number of BO iterations = the number of simulations/"experiments") increases. The first metric in (a) is the maximum deliverable capacity among the acquired set of COFs, X n . The second and third metric are, among the acquired set of COFs X n , (b) the highest deliverable capacity rank, with the rank defined using the entire data set X, and (c) the fraction of the COFs in X with the highest deliverable capacity. % of the BO searches acquired the optimal COF x * after n = 120 COF evaluations; all BO searches acquired the optimal COF after n = 174 COF evaluations. After n = 250 COF evaluations, BO acquires % of the top COFs in the data set. This illustrates how BO can provide a simulation-efficient search for the optimal COF as opposed to conducting an exhaustive search; BO picked out the top COF in the database of ∼ COFs while evaluating less than ! BO also outperforms random search, evolutionary search, and the one-shot supervised learning approach. With the same number of COF evaluations (n = 250), a random search (on average) acquires only the th ranked COF and . % of the top COFs. The performance of random search is poor because it ignores the past observations when it selects the next COF to evaluate. Evolutionary search and one-shot supervised learning (diverse training set) have a much better search performance than random search, acquiring the th/ th ranked COF and the top %/ % of the top COFs on a budget of evaluations. Though, evolutionary search nor the one-shot supervised learning strategies recover the optimal COF x * after evaluations. Thus, BO outperforms the baseline search methods of evolutionary search and one-shot supervised learning using both metrics of (a) the highest deliverable capacity in the acquired set and (b) the fraction of the top COFs in the acquired set given a budget of evaluations. N.b., BO is designed to optimize the former performance metric, but BO could be tailored to optimize a top-k metric [ , ].
Regarding the random versus diverse training set acquisition strategies for the one-shot supervised learning approach: Except when the training set is very small ( ), the diverse selection of training data gives better search performance than the random selection of training data since it provides better coverage of the feature space.

. . Visualizing the BO acquisition set
To understand the behavior of BO for searching the database of COFs for the optimal COF x * , we visualize the acquisition set of COFs in feature space as BO progresses. Given that the feature space is -dimensional, we resort to principal component analysis (PCA) to [approximately] reduce the dimension of the feature space to two. I.e., we project each COF feature space onto a D reduced feature space through PCA of the data [x 1 x 2 · · · x |X | ] First, we visualize the structure-property relationship f (x).

. . Balancing exploration and exploitation
We conceptually illustrated how the expected improvement acquisition function balances exploration and exploitation in Fig. .
The expected improvement (EI) acquisition function in eqn. , used in Fig. ,  .

Conclusions from experiments
BO is an active search method to find the optimal NPM in a data set whilst evaluating, with some expensive method such as a molecular simulation, only a small fraction of the NPMs in the data set. BO achieves this by making acquisition decisions that take into account all past observations and balance exploration and exploitation. The adoption of BO could dramatically impact high-throughput computational screenings of NPMs by reducing the computing cost of finding the optimal NPM, allowing us to screen larger databases of NPMs, and enabling the use of higher-fidelity but more expensive molecular models and simulation methods. Notably, BO applies to NPM search in the experimental domain as well.

Outlook
We explained the key ideas behind Bayesian optimization (BO) and advocated for its use to efficiently search databases of NPMs for the one with the optimal property, whilst synthesizing and evaluating the fewest NPMs. The ideas of BO, to sequentially, actively make intelligent decisions on which NPM to synthesize and evaluate based on the past experiments, can be applied to both the laboratory (driven by humans or robots [ -]) and computational settings. The two core ingredients of BO are ( ) a surrogate model that approximates the structure-property relationship and describes our uncertainty in it and ( ) an acquisition function that scores the utility of evaluating each NPM next, designed to balance exploration and exploitation. We demonstrated BO of NPMs by using BO to search through a database of ca. COFs to find the COF with the highest methane deliverable capacity; all BO searches acquired the optimal COF after evaluating only COFs. While preparing our article, Donval and Hand et al. [ ] also demonstrated BO of MOFs and COFs for the acceleration of virtual screenings.
There are several extensions to and modifications of Bayesian optimization that are useful for different problem settings in NPM discovery: • Batch BO. In standard BO, we select a single NPM to evaluate in each iteration. However, we may have parallel experimental resources to leverage to further accelerate the search for the optimal NPM. In batch BO [ , -], we can select multiple NPMs to synthesize and evaluate in parallel during each BO iteration.
• Multi-fidelity BO. Often, we have a choice of multiple experimental methods to evaluate the property of an NPM. These methods usually involve a tradeoff in resource cost and the accuracy of the evaluation. For example, a molecular simulation of gas adsorption in an NPM is a low-fidelity experiment (cheap, but inaccurate) while measurement of gas adsorption in an NPM in a physical laboratory is a high-fidelity experiment (costly, but accurate). Intuitively, it is possible to leverage low-fidelity experiments to prune NPMs with low property values and to identify promising NPM candidates that can be searched further using high-fidelity experiments. In multi-fidelity BO [ -], we select both an NPM to evaluate and the fidelity of the experiment in each iteration. This allows optimization of the overall resource cost of experiments-of both low-and high-fidelity-for identifying high-performing NPMs.
• Multi-objective BO. We often need to optimize NPMs for multiple property objectives which are conflicting in nature and cannot be optimized simultaneously. For example, for gas separations, we often wish an NPM to have both a high selectivity and a high working capacity for the gas we wish to capture [ ]. For multi-objective optimization problems, we need to find the Pareto optimal set of solutions. A solution Pareto optimal if it cannot be improved in any of the objectives without compromising some other objective. The goal of multi-objective BO is to find the optimal Pareto set of NPMs using fewest NPM evaluations [ -]. Similarly, the -PAL algorithm [ ] has recently been used to find the Pareto optimum polymers.
• Constrained BO. Possibly, some NPMs in the search space cannot be synthesized. More, often we cannot know if an NPM is synthesizable until we attempt its synthesis, which still incurs a cost. In this context, synthesizability is a black-box constraint over the search space.
In constrained BO [ -], we perform BO where the synthesizability of an NPM cannot be verified without performing an experiment. The typical approach involves learning a statistical model based on the past evaluations of constraint(s) and selecting high-utility NPMs from the predicted feasible region (minimal to no constraint violation).
• Cost-aware BO. The evaluation cost can vary from one NPM to another (e.g., cost of synthesizing NPM). We would like to take this cost into account to reduce the overall costs incurred during the search for the optimal NPM. In cost-aware BO [ , ], the acquisition strategy considers not only the information gain of acquiring an NPM but also the cost incurred to synthesize it and measure its property.
• Robust solutions to BO. We may be uncertain about the measured/computed features of the NPMs and seek an optimum NPM that is robust to variations in its features. In robust BO, we account for the uncertainty in the inputs x when optimizing f (x) and seek flat as opposed to sharp optima [ , ]. In addition to efficiently searching for NPMs with optimal properties, BO is applicable to a wide variety of optimization problems in the chemical and materials sciences [ , ]. BO has been used to efficiently search for optimal reaction conditions [ , ], compositions of and processing conditions for materials [ , ], The effectiveness of BO is predicted upon an accurate surrogate model of the structure-property relationship. In turn, the accuracy of the surrogate model is predicated on (i) an information-rich representation x of the NPM that encodes the salient features of its structure and chemical composition and (ii) a statistical model that (a) is sufficiently flexible/expressive to approximate the underlying objective function and (b) learns in a data-efficient manner. This gives important and currently active directions for future research. Particularly, engineering useful vector representations x of NPMs, using domain knowledge, is a very active research area [ , ]. The representation should be invariant to rotations, translations, replications (if a crystal), and permutations of the list of atoms comprising the structure. Moreover, the mapping from NPM structures to feature vectors should be injective. Graph neural networks can, instead, learn vector representations of NPMs from their crystal structures represented as graphs with node labels.