Abstract
Nanoporous materials (NPMs) could be used to store, capture, and sense many different gases. Given an adsorption task, we often wish to search a library of NPMs for the one with the optimal adsorption property. The high cost of NPM synthesis and gas adsorption measurements, whether these experiments are in the lab or in a simulation, often precludes exhaustive search.
We explain, demonstrate, and advocate Bayesian optimization (BO) to actively search for the optimal NPM in a library of NPMs-- and find it using the fewest experiments. The two ingredients of BO are a surrogate model and an acquisition function. The surrogate model is a probabilistic model reflecting our beliefs about the NPM-structure--property relationship based on observations from past experiments. The acquisition function uses the surrogate model to score each NPM according to the utility of picking it for the next experiment. It balances two competing goals: (a) exploitation of our current approximation of the structure-property relationship to pick the highest-performing NPM, and (b) exploration of blind spots in the NPM space to pick an NPM we are uncertain about, to improve our approximation of the structure-property relationship. We demonstrate BO by searching an open database of ~70,000 hypothetical covalent organic frameworks (COFs) for the COF with the highest simulated methane deliverable capacity. BO finds the optimal COF and acquires 30% of the top 100 highest-ranked COFs after evaluating only ~120 COFs. More, BO searches more efficiently than evolutionary and one-shot supervised machine learning approaches.
Version notes
- added new section illustrating exploration/exploitation balance by using EI, max y, and max sigma as acquisition functions
- for evolutionary search, when a new acquired point in feature space is asked for, we search for the closest COF in the database *not in the acquired set*. [before, the evolutionary search was picking the same COF over and over, and this was counted towards a COF evaluation]. we also clarify this in the text now.
- normalize outputs in BO *only based on the training/already-acquired observations*
- random forest: now budget of evaluations is used for 50% explore, 50% exploit. also the proper number of training data is used for the search efficiency plot, so now RF is better than the random search, consistent with intuition.
- refactored and commented code in Jupyter Notebooks
Content

Supplementary weblinks
Github repo with the code
code to reproduce our results.