Data-driven matching of experimental crystal structures and gas adsorption isotherms of metal-organic frameworks †

Porous metal-organic frameworks are a class of materials with great promise in gas separation and gas storage applications. Due to the high dimensional space of materials science and engineering, computational screening techniques have long been an impor-tant part of the scientiﬁc toolbox. However, a broad validation of molecular simulations † Oﬃcial contribution of the National Institute of Standards and Technology; not subject to copyright in the United States. Certain commercially available items may be identiﬁed in this paper. This identiﬁcation does not imply recommendation by NIST, nor does it imply that it is the best available for the purposes described. in these materials is impeded by the lack of a connection between databases of gas adsorption experiments and databases of the atomic crystal structure of corresponding materials. This work aims to connect the gas adsorption isotherms of metal-organic frameworks collected in the NIST/ARPA-E Database of Novel and Emerging Adsorbent Materials to the corresponding crystal structures in the Cambridge Structural Database. With tens of thousands of isotherms and crystal structures reported to date, an automatic approach is needed to establish this link, which we describe in this paper. As a ﬁrst application and consistency check, we compare the pore volume measured from low-temperature argon or nitrogen isotherms to the geometrical pore volume computed from the crystal structure. Overall, 545 argon or nitrogen isotherms could be matched to a corresponding crystal structure. We ﬁnd that the pore volume computed via the two complementary methods shows acceptable agreement only in about 35 % of these cases. We provide the subset of isotherms measured on these materials as a seed for a future, more complete reference data set for computational studies.


Introduction
Porous materials are employed in many applications, such as gas storage, 1 separations, 2 catalysis, 3,4 and sensing. 5,6 For many years, the porous materials research and development community was dominated by activated carbons and zeolites, but over the last two decades the field has expanded grown enormously thanks to the discovery of porous metal-organic frameworks (MOFs), 7 and covalent organic frameworks (COFs). [8][9][10] In this work, we are interested in the gas adsorption properties of MOFs: crystalline frameworks constructed from metal nodes and organic linkers. At present, crystal structures for over 10 000 different porous MOFs have been reported in the Cambridge Structural Database (CSD), 11 and many computational studies have performed molecular simulations to predict the gas adsorption properties of these materials starting from the reported crystal structure. 12 The NIST/ARPA-E Database of Novel and Emerging Adsorbent Materials (NIST-ISODB), 13 on the other hand, is the world's largest public collection of experimental gas adsorption isotherms. Its over 30 000 isotherms cover a wide range of adsorbent materials including not only MOFs, COFs, and zeolites, but also activated carbons and amorphous porous polymers, thus making it a treasure trove for data-driven analysis. 14,15 The database also holds great potential for the integration of data-driven approaches with physics-based models -if a link between an adsorption isotherm in the NIST-ISODB and the crystal structure of the corresponding MOF in the CSD can be established. 16 For example, in 2017, Sholl et al. investigated differences between independent measurements of CO 2 isotherms on the same MOFs, providing guidelines on how to identify confidence thresholds, assigning ratings for consistency and reproducibility, and comparing the experimental data to simulated CO 2 isotherms. 15 The study was limited to materials for which adsorption measurements had been independently repeated by several groups : 27 MOFs with two or more independently measured CO 2 isotherms (211 isotherms in total).
For a few tens of MOFs, the link between the adsorbent in the NIST-ISODB and the crystal structure in the CSD could therefore be established manually, for example by relying on the conventional names of the MOFs or on the CSD reference codes reported in the publications.
The same group extended the analysis in 2020 with focus on alcohols, comparing simulations of methanol and ethanol in four MOFs with the interval of confidence obtained from experimental isotherms via their protocol. 17 In this work, we aim to extend this link to as many gas adsorption isotherms for MOFs as possible, irrespective of the target application: how many MOF identities can we link both to experimentally resolved crystal structures (CSD) and to experimental gas adsorption isotherms (NIST-ISODB) by relying on the metadata present in the two databases?
This raises a further question: how can we gauge whether the linked crystal structure is a reasonable model for the experimental sample on which the adsorption study is performed?
In order to address this question, we recall that the first step of an adsorption study typically involves the characterization of the adsorbent's pore volume by recording a nitrogen or argon adsorption isotherm at low temperature. 18 At the same time, the pore volume can also be directly computed from the crystal structure and, being foremost a geometric property, 19 carries less uncertainty related to force field parameters than, e.g., the simulation of a CO 2 adsorption isotherm. 1 These calculations assume a defect-free, infinite crystal, which is impossible to obtain experimentally. While some deviations from the experimental pore volume can therefore be expected, large deviations indicate that the crystal structure is not representative of the actual sample. We therefore propose using the pore volume as a basic consistency check.
In the following, we explore different routes to establishing the structure-isotherm match.
We report on statistics, and perform the consistency check described before, comparing the computed and measured pore volumes. We discuss the different reasons that lead to mismatches, and provide a reference set of isotherms with linked crystal structure deemed suitable for comparison with molecular simulations. The reader can inspect all the steps performed within this pipeline, by browsing the Jupyter Notebooks provided in the GitHub repository https://github.com/danieleongari/matching_isodb_csd.

Overview and inspection of the NIST-ISODB and CSD
This analysis is based on the NIST-ISODB version as available from the official GitHub repository on September 2021, 20 containing 35 482 digitised isotherms for 7386 adsorbent materials and 280 different adsorbate molecules.
The NIST-ISODB was initially conceived as a list of publications on novel adsorbents with minimal metadata. Of the 4128 indexed publications, ≈ 80 % are associated with digitized isotherms, mostly obtained from measurements, but also from fitting to experimental data or from molecular simulations. The collection of metadata on the method used to obtain isotherms started only at a later stage, first at the level of the publication and then at the level of individual isotherms. We include only isotherms that are marked themselves as "experimental" or that are contained in a reference that is marked to contain exclusively experimental isotherms. After excluding isotherms coming from simulations or models and isotherms of unknown origin (unspecified, or the reference paper is denoted to contain both), we end up with 21 375 experimental isotherms (60 % of the total 35 482).
The number of different adsorbate molecules is limited, and mapping their conventional names to their chemical formula is straightforward. The NIST-ISODB includes a mapping of different synonyms for the same gas, such as water and H 2 O, to the same InChIKey that uniquely identifies the gas molecule.
As for the adsorbent materials, the NIST-ISODB includes MOFs, zeolites, activated carbons, covalent organic frameworks, and other classes of materials. Adsorbents are identified by a name, which is typically taken from the figure caption of the digitized adsorption isotherm and therefore poorly standardized. While 3.2 % and 7.0 % of the names contain the keywords "zeolite" and "carbon", respectively, classifying the remaining adsorbents based on their name is difficult to classify in an automated way from their name. We can expect that a large portion of them are MOFs due to the large quantity of different MOFs reported up to now, but the reader can refer to the work of Cai et al. for more statistics about the manual classification of a sample of 333 adsorbents from gas adsorption data indexing including the NIST-ISODB. 21 The NIST-ISODB also includes a mapping of synonyms for adsorbents. For example, the CuBTC MOF has also been reported as Basolite C300, C300, Cu-BTC, Cu 3 (BTC) 2 , CuBTC, and MOF-199, and in the NIST-ISODB all the corresponding isotherms are assigned to the same adsorbent. We want to emphasize from the beginning that constructing such a mapping for adsorbents, however, is more complex since the conventional names for MOFs can be ambiguous: this is a challenge that will recur over and over in this work. For example, MOFs such as MOF-74 or MIL-53 can be synthesized with different metal nodes and the NIST-ISODB record does not always specify the identity of the coordination metal (e.g., "MIL-53" instead of "MIL-53(Cr)"). Furthermore, some authors use generic names like "MOF-1" intended only to enumerate materials within the particular study. We therefore excluded ambiguous adsorbent names from the NIST-ISODB mapping of synonyms in order to avoid matching materials by synonym that do not have the same crystal structure. Interested readers can find a full list of excluded names and a detailed discussion in Section Among the 21 375 experimental isotherms collected in the NIST-ISODB, the main adsorbates are nitrogen (5153), carbon dioxide (4366), hydrogen (2540), methane (2212) and water (607). Of particular interest for the present study are the experimental isotherms for nitrogen at 77 K (a total of 4003) and for argon at 87 K (a total of 140).
We note that the NIST-ISODB can contain multiple, different isotherms for the same publication, adsorbent, adsorbate, and temperature. Publications with two or three isotherms recorded under identical conditions often stem from the digitization of multiple figures that display the same isotherm data in different pressure ranges or compared to different adsorbate or adsorbents. Publications with three or more "duplicates" often report adsorption in adsorbents synthesized under different conditions, where the isotherm serves as a benchmark of the quality of the material (e.g., Figure 7 in Ref. 22), or to illustrate the effect of some post-synthetic treatment (e.g., Figure 4 in Ref. 23). In these cases, therefore, the name of the MOF is reported as the same for the digitization, but the sample does contain differences. Other reasons include reporting multiple adsorption-desorption cycles on the same sample. The experimental reproducibility of isotherms in the NIST-ISODB for the same gas-adsorbent pair has been discussed in the literature, e.g., by Sholl and coauthors. 15 Clearly, however, isotherms that have been measured to study the effect of different synthesis conditions 24 or post-synthetic treatments can be expected to differ from each other, unlike measurements of samples that have been synthesized to reproduce the same material, or studies on the same sample by different groups. A notable case is the NIST inter-laboratory study of methane adsorption in zeolite Y, which reports 109 isotherms at the same conditions: different research groups were asked to independently measure the uptake at 298 K as a way of investigating the reproducibility of the measurement. 25 We postpone the filtering of isotherms recorded under apparently identical conditions to a later stage, when more information on these can inform a rational protocol for their selection.
Moving to analyse the CSD database of crystal structures, in the present study we em- conditions. In the absence of an automated method for selecting a representative structure, this selection eventually requires manual inspection. In order to limit the effort involved, we exclude CSD entries if there are more than three other entries of the same conventional name from the same paper. The reasoning behind keeping three structures instead of just one is to be able to evaluate the uncertainty on the computed pore volume from different measurements of the crystal structures.
The following subsections describe the matching procedure, which is summarized in Figure 1.

Matching by conventional names
The most straightforward way of matching an adsorbent from the NIST-ISODB to a MOF crystal structure in the CSD is by conventional name of the MOF, i.e., when any of the synonyms related to a NIST-ISODB's adsorbent matches with any synonym on the CSD side. First match by conventional name. The remaining three approaches attempt a match by reference to a common published article that reported both the crystal structure(s) as well as the adsorption isotherm(s).
One downside of this approach lies in the low number of conventional names reported on the CSD side. In many cases, authors do not specify the conventional name used in the publication as metadata during the deposition to the CSD, meaning that a conventional name may have been used on the NIST-ISODB side that is not present in the CSD. In other cases, the publication reporting the crystal structure may not mention a conventional name either, in which case the name reported in the NIST-ISODB is typically chosen as the chemical formula of the unit cell, for example C 66 H 50 B 2 CoCu 6 N 18 O 24 or C 11 H 12 GdO 11 . In these cases, it is clearly harder to identify the same material in other publications, as it will likely be labelled with a conventional name or with a formula in a different format. Moreover, two MOFs can have a the same formula but different topology, giving rise to different gas adsorption properties.

71')
The first normalization step led to 12 additional matches, one of which turned out to be a false positive upon manual inspection: in CD-MOF-1, "CD" stands for cyclodextrin, 27 while in Cd-MOF-1, "Cd" stands for Cadmium. While this was the only false positive, this finding suggests that case mismatches should continue to be inspected manually in future updates.
The third step becomes necessary since the conventional name of the materials in the CSD sometimes contains the name of the adsorbed gas as a suffix. We compiled a list of common adsorbates (see Supporting Information, Section This approach provided matches for only 334 out of the 7386 NIST-ISODB adsorbents. The most frequently reported MOFs in this set are ZIF-8, UiO-67, IRMOF-1 and CuBTC (HKUST-1), reported respectively in 12, 6, 5 and 5 distinct publications. For example, CuBTC is associated (via this conventional name or any of its synonyms) with the CSD entries FIQCEN (deposited together with the original 1999 paper), 28 BOPAN, DIHVIB, DOTSOV42, and LUDLED. One should note that these are the most interesting cases for our study, as they will allow us to compare structures and isotherms measured from different studies for the same MOF. However, only a minority of the MOFs we could match by synonym are present in different articles and thus have distinct measurements (22 in total), which motivates our search for other ways to match structures with NIST-ISODB adsorbents.

Matching by DOI
A complementary approach to match an adsorbent from the NIST-ISODB to a MOF crystal structure in the CSD is to use the DOI to identify entries coming from the same publication.
After converting all DOIs to lower case, we find an additional 476 matches between NIST-ISODB adsorbents and any CSD entries in the MOF-subset that were not already matched by synonym. These matches fall into three categories, depending on how many CSD entries and NIST-ISODB adsorbents are linked to the publication used for the matching.
In the first category, the publication is associated with exactly one CSD entry and one NIST-ISODB adsorbent, making it reasonable to assume a direct match. We count 319 NIST-ISODB adsorbents added to the successful matches because of this one-to-one reference.
In the second category, the reference paper is associated with one NIST-ISODB adsorbent but multiple CSD entries: 157 NIST adsorbents are mapped to 583 CSD entries.

Reasons for this include:
(SGHs) identified different desolvated structures which were discarded for the time being, since only a manual check of the report would allow to select a representative structure to match with the NIST-ISODB adsorbent.
In the third category, the reference paper is associated with multiple NIST-ISODB adsorbents (and, usually, multiple CSD entries): we count 330 articles falling in this category, which are associated to 937 CSD entries. For example, the DOI of Ref. 30 is associated with CSD entries DANWOF and DANWUL, and with isotherms in the NIST-ISODB for two MOFs labelled MOF-235 and MOF-236. Since these names are not reported in the CSD metadata, the association between isotherm and CSD entry is lost and could only be recovered by a careful reading of the publication: it would be a substantial effort considered the many CSD structures in this category. Therefore, we decided to discard these ambiguous matches.
To recap, by matching both the conventional name and the DOI, we were able to identify the structure for 683 NIST-ISODB adsorbents, linked to 842 CSD entries: of these, as previously reported, about one half are matched by same name and the other half by oneto-one DOI match.

CSD structure analysis
Comparing structure graphs The CSD contains a sizeable number of identical crystal structures reported under different reference codes. 31,32 In the following, we describe how to flag these based on comparing their atomic structure graphs.
Before performing a gas adsorption study, it is common practice to activate the adsorbent, resulting in a more porous structure as solvent molecules are removed. In the CSD, structures are often reported with solvent molecules still present inside the pores. Different research groups may synthesize the same material with different solvents or report crystal structures at different activation stages. For this reason, we computationally removed all the free and coordinated solvents using the algorithm provided with the original release of the CSD MOF subset. 33 In particular, we only removed solvent molecules listed on a list of common molecules provided by the CSD API. This avoids the removal of (necessary) charge-counterbalancing ions, or even the removal of some parts of the crystal structure. 34 After removing the solvent, we compute the primitive cell of the crystal structure using pymatgen 35 and spglib 36 (with a symmetry tolerance of 0.1). We then use the VESTA 37 cutoffs for bond distances to construct a structure graph (see Figure 2), in which every atom is a node and the edges are the bonds inferred using the VESTA cutoff heuristic. between the atoms (shown as blue nodes), thus making it robust against small changes in the atom positions or cell dimensions. Second, we use the Weisfeiler-Lehman algorithm to aggregate information about the neighborhood of each node and compute a characteristic fixed-length hash of the graph (more information in Figure S6 of the Supporting Information). Comparing two graphs for identity then reduces to comparing their structure graph hash.
Given the structure graph, we then compute the Weisfeiler-Lehman 38 graph hash of the undirected structure graph using networkx, 39 using the atom types as node labels. The structure graph hash (SGH) of two crystal structures should be identical if and only if the bond network of the two structures is identical, allowing us to identify duplicates simply by comparing their SGH.
We also compute the hash of the undecorated graphs (ignoring atom types), which allows us to identify structures that only differ by the metal (e.g., Co-MOF-74 vs Ni-MOF-74).

Consistency checks on names and crystal structures
Matching CSD structures via their SGH provides a complementary method to matching them via other metadata, such as the DOI of the publication or their conventional name.
This provides us with an opportunity to perform a consistency check.
First, we investigate those CSD entries that include a conventional name but were not on the list of synonyms of the NIST-ISODB and therefore could only be matched by DOI.
It is instructive to list the classes of reasons for why no match was established: • One name explicitly reports the metal, the other does not (e.g., DUT-49(Cu) vs DUT-49) • The conventional name used for a material is not yet known as a synonym in the NIST-ISODB (e.g., UHM-30 vs Cu 3 (NH 2 btc) 2 ) • The CSD reports a the conventional name, while the chemical formula is used in the NIST-ISODB (e.g., C 26 H 8 Cu 2 N 2 O 12 and SNU-50) We note that these issues could be addressed on the CSD side by adopting stricter rules for the reporting of conventional names, as well as by expanding the list of known synonyms in the NIST-ISODB.
As a second consistency check, we took all the CSD entries linked to a given NIST-ISODB adsorbent and checked whether the SGH of all structures was identical. Manual inspection revealed that in a minority of cases the same MOF name was indeed used to identify different structures in independent reports: this is usually the case for generic names chosen for enumeration purposes within the publication (e.g., MOF-1, Cd-MOF-1, PCP-1/2/3). More elaborate names can clash as well, however: for example the name "ZJU-21" was used to identify both a Cu-based MOF in 2014 40 and a Zn-based MOF in 2016. 41 Both reports are from authors affiliated with Zhejiang University (acronym "ZJU") but do not share any co-author. However, in most of the cases, different SGHs point to slight differences in the reporting the crystal structure of the "same material": crystal structures with/without disorder in the framework, disorder in the solvent which therefore was not recognised as a known solvent molecule and not removed by the "computational activation", or the presence/absence of hydrogen atoms in the reported CIF.
In response, we removed CSD entries that can not be uniquely matched with a NIST-ISODB adsorbent as well as entries with overlapping atoms. This operation was done manually, and the exclusions are tracked in the GitHub repository associated to this project. 42 Finally, we compared the SGH of all structures once more, but with a different purpose: to identify cases where the SGH of two CSD structure was the same but the name was different. For example, only experienced scientists in the MOF field are likely to recognize MAF-4 as a synonym for ZIF-8.: 43 this analysis allowed us to add this new synonym to the NIST-ISODB adsorbent definition.
Informed by the previous visual inspection, we run a further check on the crystal structures: remember that the most of them do not have a second structure to compare for the verification. We further removed CSD structures with atomic overlaps (38) and lone molecules (80, possibly disordered or unrecognized solvent). We also checked for the presence of hydrogen atoms (in certain cases not explicitly included in the crystal structures deposited at the CSD) but given the low impact of hydrogen in the calculation of the internal volume and the weight of the crystal, we did not exclude these 47 defective structures files.
After the data cleaning of this section, we are left with 569 NIST-ISODB adsorbents, matched with 666 CSD entries.

Pore volume comparison
A key motivation of this study is that the connection between an isotherm and a crystal structure enables the comparison to predictions from molecular simulations. Molecular simulations typically represent the adsorbent as an infinite perfect crystal, while experimental samples may include defects, amorphous regions or regions where the sample is only partly activated. Comparing the experimental to the theoretical pore volume thus provides a first consistency check for the periodic crystal representation.
Experimentally, the pore volume is routinely determined from the adsorption isotherm for nitrogen at 77 K or argon at 87 K using the Gurvich Rule. 18 Therefore, a large number of these low-temperature characterization isotherms are available in the NIST-ISODB: for 291 of the 569 matched adsorbents at least one characterization isotherm is reported. This ratio is slightly higher than in the NIST-ISODB overall 2 but it still forces us to exclude a substantial number of MOFs, and for the future of the NIST-ISODB we suggest placing additional focus on providing these key characterization data in digital form.
Rather than relying on the experimental pore volume reported by the authors, we can now use the characterization isotherms in order to consistently apply the same methodology for computing the pore volume across all structures. Figure 3 shows the application of our methodology.
We extract the experimental pore volume from the average uptake (of nitrogen at 77 K or argon at 87 K) in the 0.6 bar to 0.8 bar range. Only a few characterization isotherms (29 over 860) were excluded because no pressure points were falling within this range. The Gurvich rule states that the density of the saturated nitrogen (or argon) in the pores is equal to its liquid density (ρ liq N 2 ; equal to 28.83 mol L −1 and 34.98 mol L −1 for nitrogen and argon, respectively), regardless of the shape of the internal void network or the chemistry of the crystal structure. 44 3 Figure 3: Comparison of measured and computed pore volume for the NIST-ISODB adsorbent CuBTC from four different isotherms. The average nitrogen uptake in the 0.6 bar to 0.8 bar range (red markers) is used to measure the experimental gravimetric pore volume (red line), and it is compared to the geometric pore volume computed from the crystal structures (green line): all the CuBTC structures we previously found lead to a similar pore volume of 0.81-0.83 cm 3 g −1 . Note how relating the two calculations of the pore volume, immediately gives an idea on the quality of the sample for which the isotherm was measured.
Under this assumption, the pore volume (v pore ) of the adsorbent is computed as: v pore = n ads,sat where the adsorbate uptake n ads,sat N 2 is converted to units coherent with ρ liq N 2 . A large majority of characterization isotherms report the adsorbate uptake in cm 3 (STP)/g, thus yielding the gravimetric pore volume. In cases where the isotherm is reported in volumetric units for the adsorbent instead, the adsorbent density (which is not recorded in the NIST-ISODB) would be needed to obtain the gravimetric pore volume. While the denchannels, for example in the case of commensurate adsorption. 45 sity could be computed from the linked crystal structure, we decided to exclude these cases (less than 1 % of the characterization isotherms finally selected) in order to avoid additional methodological uncertainty as the authors may have used a different value for the density of the material than the one we would compute.
On the other hand, the pore volume can be computed from the crystal structure. 19 Here, we choose the geometrical pore volume, which is an upper limit to the probe accessible pore volume. It is intuitively defined as all volume inside the unit cell that is not occupied by the atoms of the framework (described as hard spheres with Bondi's van der Waals radii). 46 We note that certain pore pockets in adsorbents can be inaccessible to the adsorbing molecule due to narrow connection channels, which is not reflected in the geometric pore volume.
However, which pores are inaccessible computationally can be highly sensitive to parameters such as the kinetic radius of the molecule, its diffusion kinetics, the atomic radii used to model the framework, and the assumption of a rigid crystal (i.e., no "saloon-door" effect 47 ).
In the Supporting Information we compare different definitions of the pore volume, and conclude that their choice has negligible impact on the final statistics obtained in this work, leading us to prefer the geometric pore volume definition.
We compute the geometric pore volume using Zeo++ 48 from the experimental structures retrieved from the CSD database after computational desolvation via the CSD Python API. 49 If more than one crystal structure matches a given NIST-ISODB adsorbent, we select the crystal structure with the largest geometric pore volume as a reference. Supplementary Figure SI-1 reports the geometric pore volumes for the structures of those adsorbents associated with more than one CSD entry. The difference in pore volume is often small, i.e., below 10 % in more than 80 % of the cases.
When multiple characterization isotherms for a given adsorbent are available from the same paper, it is tempting to attribute these differences to experimental uncertainty or inaccuracies of the digitization. However, inspection of some of these articles reveals that in most cases such isotherms are used for the characterization of different synthesis or activation attempts, and the isotherm with the maximum pore volume typically corresponds to the optimal procedure. We acknowledge that the maximum pore volume does not unequivocally imply optimal crystallinity -for example, the presence of defects can also lead to higher pore volumes 50 -however, recognizing such cases goes beyond the scope of our automated comparison. For the sake of consistency, in cases of multiple isotherms, we therefore selected the one giving the maximum pore volume and discarded all others from the same article.

Results and discussion
Having established the link between adsorbents, crystal structures and experimental isotherms for 569 MOFs, we can analyse how the measurement and calculations compare for the same material. Figure 4 compares the measured pore volumes as calculated from nitrogen and argon isotherms to the geometric pore volume computed from the crystal structures. Figure 5 shows the same data in the form of a histogram of the ratio between measured and geometric pore volume. As the geometric pore volume will overestimate the measured value, 19 and considering some uncertainty, one would expect those materials that are fully activated and nearly fully crystalline to fall in the 0.75-1.1 range for this measured/computed ratio. It is encouraging to see that we observe a peak in this range, accounting for ≈35 % of the measured pore volumes. However, the majority of materials falls outside this range.
It is interesting to investigate these materials in more detail, dividing them into three rough categories.
The first category are ratios close to zero (i.e. < 0.1 ratio), which account for ≈10 % of the samples. These measurements report negligible or no uptake of nitrogen or argon. Figure   5 shows that most of the materials close to the x-axis (near-zero measured pore volume) also have a below-average geometric pore volume in the range of 0.25-0.5 cm 3 g −1 . With such a small pore volume computed from the crystal structure, one could suspect small interstices where the nitrogen (or argon) may not fit or permeate. Number of measured pore volumes Figure 4: Comparison of geometric pore volume computed from the crystal structure to measured pore volume obtained from nitrogen isotherms at 77 K or argon isotherms at 87 K. If more than one crystal structure is available for the same material, the one with the largest geometric pore volume is used as a reference. The color scale indicates the number of papers containing characterization isotherms for a given material, from one (dark blue) to 8 and more (yellow). The most reported MOFs, with vertically aligned yellow markers, are MIL-53, UIO-66, MOF-74, ZIF-8, CuBTC, MIL-100 and IRMOF-1 in order of increasing computed volume (see Table  1 for more details). Grey lines indicate a ratio of measured to computed pore volume of 100 %, 75 %, 50 %, and 25 %.
To investigate this further, in Figure 6 we zoom into the structures with a measured pore volume below 0.1 cm 3 g −1 and compare them to the pore-limiting diameter computed for the structure.
The pore-limiting diameter is the diameter of the largest molecule that can diffuse through the structure, and Figure 6 diameter. For structures with a pore-limiting diameter below the kinetic diameter of nitrogen or argon, zero uptake is expected, and these MOFs can be reasonably considered as nonporous. When the pore-limiting diameter is close to the size of the adsorbate, kinetics of diffusion can still be very slow, leading to negligible uptake at low temperature. However, we notice a significant fraction of structures with pore-limiting diameter much larger than the kinetic diameter of nitrogen. In these cases, the measurement may have been conducted on a material that collapsed after solvent removal, solvent removal may have been unsuccessful, or the presence of floating counter-ions may not have been reported in the crystal structure.
Another possible explanation is that the synthesis of the porous material was unsuccessful, and the authors reported the nitrogen isotherm for the nonporous sample to document a failed attempt.
The second category includes the ratios between 0.1 and 0.75, for which the measured pore volume is substantially lower than the geometric pore volume. Some of the hypotheses mentioned above still apply: partial desolvation, presence of counterions, presence of unreacted ligands trapped in the pores, or partially collapsing (flexible) structures upon activation. We note that that the presence of a guest molecule or a partial collapse of the structure not only reduces the available space for the nitrogen/argon probe molecule but also increases the apparent density of the material: it affects both the numerator and the denominator in the measurement of the pore volume.
The third category includes the pore volume measurements that exceed the geometric value, ≈11.7 % of the total. Possible hypotheses include an error in the structure-isotherm match or the crystal structure itself, a significant presence of defects in the crystal (e.g., missing ligands), or an unreliable measurement due to significant uptake on the surface of the crystal (e.g., small and packed crystals or jagged surfaces that create mesoporous interstices for probe molecules to adsorb outside the bulk of the crystal). MOFs with very strong bonds between nodes and ligands are known to display higher percentages of missing ligands: typical examples being UiO-66 and MOFs constructed from the Zr 6 O 8 secondary building unit. 51 We emphasize that the possible explanations of the observed deviation listed above are hypotheses based on our experience and the inspection of individual cases in this work. In particular, this study has allowed us to identify materials for which characterization isotherms are reported by several independent studies and to compare them. The eight materials with the highest number of characterization isotherms (yellow markers in Figure 3) are listed in Table 1. Histograms for the pore volume for each individual material are shown in Figure 7, and the full set of nitrogen isotherms are plotted in Figure 8. Table 1: MOFs with the highest number of characterization isotherms available. If more than one crystal structure was available, the one with the maximum pore volume was selected. The measured pore volume was averaged over all available nitrogen and argon isotherms. therefore measured pore volumes), with an apparent multi-modal distribution. In this context, the geometrical pore volume from the crystal structure is an absolute reference that helps pinpoint which reports involve highly crystalline and fully activated materials -an insight that would be difficult to gain from relative statistical analysis alone (e.g., using the method of Sholl and co-workers). 15 For CuBTC, the material with the most characterization isotherms (78 in total), we proceeded inspecting isotherm and paper, manually. No significant error related to the digitization process or the extraction of the pore volume from the isotherm was found (and the pore volume reported by the authors, when present, was similar to the one computed by us). We also note that while many of the reports contained CuBTC modifications or composite materials, the isotherms flagged as related to CuBTC in the NIST-ISODB indeed referred to the pristine version of the MOF, since it is often reported as a benchmark before further modification. However, this evidence suggest that algorithms that try to parse these values from the manuscripts via natural language processing may be particularly prone to errors, requiring elaborate tuning or the supervision by an expert reader. [52][53][54] When the characterization isotherms indicated a weakly porous CuBTC (< 0.4 cm 3 g −1 ), this fact was usually mentioned by the authors (e.g., in the case of pellets, 55 or alternative synthesis routes 56 ). Among the isotherms from which we computed a low pore volume in the range of 0.4 cm 3 g −1 to 0.5 cm 3 g −1 , some authors attribute the low porosity to partial activation. 57 In many other cases, however, authors did not recognize the pore volume of their sample as low, despite it being less than half of the theoretical pore volume of the perfect, solvent-free crystal (as well as some of the highest reported experimental values).
We can only speculate that they may have been influenced by the numerous other reports of pore volumes in this range and did aim to consult an independent benchmark. Going forward, we suggest to consider adsorption analyses and conclusions drawn from works on low-porosity CuBTC samples with caution.
Since incomplete activation was the most-cited reason for low pore volumes, we modelled the expected pore volume for CuBTC in the presence of several solvents. As shown in Table   2, partial activation can certainly explain the large reductions observed in CuBTC (but other hypotheses are also plausible). in an experimental pore volume that dramatically exceeded the geometric one (11.34 vs. 0.82 cm 3 g −1 ). 58 In another case, the experimental value of 0.978 cm 3 g −1 slightly exceeds the geometric pore volume of CuBTC, and the manuscript also reports a very high BET surface area of 2327 m 2 g −1 , the largest ever reported to our knowledge. 59 While we are not able to determine in retrospect what led to this large value -for example, the calibration of the instrument, a material with large defects, or an imprecise BET calculation 60 -our analysis points at potential benefits from checking the geometric pore volume of the sample before moving on to measure the adsorption of other gases. Finally, we briefly comment on the pore volume distributions of the other seven MOFs for which independent isotherms were reported: • MIL-53(Al) is known for swelling upon adsorption, thus opening its pores. The crystal structure we matched in this study is a closed-pore model with 0.28 cm 3 g −1 of geometrical pore volume (refcode: SABWAU01). 62 Using an open-pore model instead, e.g., DOYBEA, 63 we obtain a geometric pore volume of 0.55 cm 3 g −1 , about twice its closed-pore configuration. Most of the measured pore volumes fall inside the range between the open and closed-pore model.
• UiO-66 shows a distribution of measured pore volumes around the geometric pore volume, while one would expect the geometric pore volume to be the upper bound.
Indeed, the article that reports the highest measured pore volume 64 (0.74 cm 3 g −1 as computed by our protocol and 0.8 cm 3 g −1 as reported in the article) mentions that 2.3/12 of the BDC ligands were found to be missing, much higher than the normally observed ratio (1/12). As we mentioned before, defects are the most likely reason for a measured pore volume exceeding the geometric pore volume of the perfect crystal. • ZIF-8 also has a distribution of measured pore volumes in the 50 % to 100 % range with respect to the geometric pore volume. One can spot in Figure 8 some outliers that should likely be double-checked from the the experimental side.
• IRMOF-1 is the second-most reproduced sample after CuBTC. It is surprising to observe the wide spread of measured values; only the comparison with the crystal structure allows to evaluate the quality of the sample and its desolvation. In a 2015 report, Sarkisov showed how computationally generated defects in the IRMOF-1 crystal structure impact gas adsorption, explaining deviations between experimental isotherms and those computed from the perfect crystal. 68 The NIST-ISODB has been collected by researchers and summer students, painstakingly digitizing thousands of adsorption isotherms from figures of academic papers. For those willing to contribute to this digitization effort, a tool has been developed as part of this work that streamlines the digitization process and makes it easy to submit new isotherms to the NIST-ISODB. 69 Going forward, however, we hope that the need for this digitization procedure will gradually recede as standard practices for reporting adsorption isotherm data are established.

Conclusions and Recommendations
For instance, the Allotrope format, 70 AniML, 71 Unified Data Model, 72 or the JCAMP-DX standard [73][74][75][76] provide not only a standardized serialization format but also standardized vocabularies for many techniques (however, at the moment, not for gas adsorption isotherms).
Some of these formats (e.g., the Allotrope format) even support contextualizing the data by referencing ontologies, which can enable powerful semantic search.
The output files generated by adsorption information vary from manufacturer to manufacturer, contain different amounts and types of metadata, and are generally not published even in their native forms. 77 This characteristic of adsorption data is a large obstacle to more efficient and more accurate entry of adsorption isotherms into repositories such as NIST-ISODB.
Evans et al. have, however, demonstrated how to convert the output files of three manufacturers' instruments plus the NIST-ISODB JSON format into a common format, the "adsorption information file" (AIF) that allows for machine-facilitated comparison of isotherms without extensive human intervention. 77,78 The AIF format does not intend to include all possible metadata regarding an isotherm measurement, but enough to allow comparison of (ostensibly) equilibrium isotherms. The AIF format has been approved as an IUPAC Project, 79 which will facilitate its development both for other manufacturers' instruments and generic isotherm data (which could include output of molecular simulations) as well as leverage and revise other IUPAC resources such as the IUPAC Gold Book. 80 The development version of the AIF is available for use even prior to completion of the IUPAC project and we highly encourage that authors release their isotherm data in the AIF development format in the supplementary information of papers without delay.
The adoption of a standard like the AIF will likely lead to both increased availability and quality of adsorption data, but on its own will not necessarily address the general challenge encountered in this work which concerns establishing links between related but independently maintained data sources, such as the CSD and the NIST-ISODB, as one can envision that similar arguments hold for matching of the structure with an IR, NMR, or XPS spectrum, and the oxidation state, 81 or the color of the crystal. 82 Importantly, the need for matching of entries in different databases could be avoided by providing metadata using unique resource identifiers (URIs), such as the ORCID for the author of an entry, or the link to a structure in the CSD, and when the MOF is not reported in the CSD there exist technologies to uniquely identify structures. 32