Understanding MOF nucleation from solution with Evolving Graphs

: Metal-Organic Frameworks (MOFs) exhibit attractive characteristics for separations such as remarkable surface area and diverse porosities. However, a mechanistic understanding of their synthesis and scale-up remains underexplored due to the complicated nature of building block interactions. In this work, we investigate the collective assembly of building units that have been experimentally observed to initiate MOF nucleation, using MIL-101(Cr) as a prototypical example. We use large-scale molecular dynamics simulations under a variety of synthesis conditions and mixture compositions. We observe that the choice of solvent (water or DMF), introduction of ions (Na + , F - ) and the relative population of MIL-101(Cr) half-secondary building unit (half-SBU) isomers has a strong influence on the cluster formation process. In more detail, the shape, size, nucleation and growth rates, crystallinity and short and long-range order largely vary depending on the synthesis conditions. We evaluate these properties as they naturally emerge when interpreting self-assembly of MOF nuclei as the time-evolution of an undirected graph. Solution-induced conformational complexity and ionic concentration have a dramatic effect on the morphology of clusters emerging during assembly, such diversity is captured by key features of the graph representation. More precisely, pure solvent leads to rapid formation of a small number of large clusters, while ions result in slower nucleation through smaller clusters in water. Finally, we use Principal Component Analysis (PCA) on graph properties to successfully deconvolute MOF self-assembly into a small number of molecular descriptors, such as the average coordination number between half-SBUs and fractal dimension, which can be followed by time-resolved spectroscopy. We conclude that graph theory can be used to understand complex processes such as MOF nucleation through providing molecular descriptors accessible by both simulation and experiment.


INTRODUCTION
Metal-Organic Frameworks (MOFs) constitute a class of porous materials that thanks to their high porosity and surface area, have ignited interest in a plethora of applications including carbon capture and storage 1, 2 , separations 3 , extraction of water from air 4 , electrodes 5 and drug delivery 6 . Nevertheless, the stability of MOFs is lower than other porous materials and their scale-up is still problematic, thus reducing their applicability 7,8 . The presence of defects in MOFs is known to affect their thermomechanical properties, stability, synthesis costs, and overall suitability for industrial applications 9, 10 . This led recent efforts to understand the detailed mechanisms associated with MOF synthesis in order to regulate the extent of defects 11, 12 . The formation of secondary building units (SBUs) during the early stages of synthesis is crucial in determining the final properties of a MOF. Ferey et al. 13 suggested a synthesis mechanism involving the formation of pre-nucleation building units (PNBUs) and their subsequent nucleation. These are soluble zero-charged species such as the half-SBUs mentioned in this work. More experimental works have identified PNBUs and evaluated their role in the final MOF structure following the approach of synthesis through SBU formation [14][15][16][17][18] . Recently, Liu et al. 19 suggested a three-step MOF nucleation mechanism through a mixed experimental and computational work. They identified metastable structures that recrystallize into the MOF, hence acting as precursors to the nucleation of crystalline MOFs, but could not elucidate the molecular mechanisms governing the process. It has been observed that the composition of the synthesis mixture affects the thermodynamics and kinetics of MOF nucleation and growth 20 . Ions are known to promote MOF crystallinity 21, 22 , while their concentration needs to be finetuned in order to achieve this 23 . At last, we should note that fluoride anions were considered in this work following the original MIL-101(Cr) synthesis 24 but fluoride-free synthetic routes also exist 25, 26 . Also, MOF synthesis optimization is very important in order to manufacture MOFs for different applications [27][28][29] . Additionally, manufacturing cost can be decreased by optimizing the synthetic process (e.g., processing time, pressure, and temperature) 30 . Optimization is often difficult due to the several possibilities of metal-ligand combinations and the large number of phases emerging for each one 31,32 . Consequently, isoreticular modification of a well-known MOF by exchanging ligands while maintaining metal-ligand connec-tivity is ordinarily employed to study new MOFs for a particular application 33 . At last, electronic structure parameters can be optimized through mixing different metals in a MOF 34 . Computational studies on the early stages MOF self-assembly are rather limited 35 . Yoneya et al. 36 studied MOF self-assembly with a focus on optimizing simulation parameters using dummy atoms in implicit solvent. Biswal and Kusalik 37 have studied MOF self-assembly using Langevin dynamics, and their results imply the existence of several local energy minima associated with the process. Wells et al. 38 investigated the early stages of MOF synthesis using Monte Carlo methods and developed an algorithm able to distinguish between different phases with respect to composition. Colon et al. 39 focused on the self-assembly of MOF-5 using enhanced sampling methods, but they did not consider all relevant metastable states as they restrained the endpoints of the biased simulation. Cantu et al. 40 investigated the assembly of MIL-101(Cr) building blocks at the density functional theory level, and identified possible SBU isomers through modelling the kinetics of their formation. Finally, the formation of "metal-free" covalent organic frameworks has been studied using coarse-grained models for building blocks 41,42 . The shape and structure of clusters can be examined as a means to characterize the growth of complex materials. In this respect, the fractal dimension is a well-known descriptor of compactness used to describe the self-assembly of metal-ligand systems 43 . Recently, new synthetic strategies to form self-similar MOFs that exhibit fractal geometry were suggested 44, 45 . Maurer et al. 46 observed that fractal dimension provides useful insight into the spatial arrangement of structures with similar radii of gyration. Tsao et al. 47 used fractal dimension to explore the potential of hydrogen storage inside MOF pores. Goesten et al. 48 investigated MOF growth through characterizing the fractal nature of precursors. At last, Liu et al. 19 linked the shape of clusters with crystallinity during MOF nucleation. Graph theory has been used to model nucleation of metaloxide compounds 49 , or construction of molecular polyhedra 50 . Also, it has been successfully applied to study the properties of metal-ligand systems 51 and MOFs 52-54 , and characterize the dynamic behavior of functional materials 55 . In a previous work, we have demonstrated how the initial population of isomer half-SBUs, choice of solvent, and ionic strength affect the thermodynamics of defect formation based on interactions between couples of half-SBUs 56 . There, we computed free energy landscapes and evaluated the equilibrium population of stable and meta-stable, crystal-like and noncrystal-like dimers formed from half-SBUs. We concluded that in absence of ions, half-SBUs form entropically controlled defects 57 , while ions tend to control the conformational landscape of dimers, by hindering ligand rotation that may lead to defects as a result of − stacking of the phenyl rings. The latter dominate the equilibrium probability of dimers in the absence of ions. Consequently, interactions between ions and half-SBUs increase the equilibrium probability of crystal-like units.
In this work we use large scale atomistic simulations of the early stages of MOF self-assembly from pre-formed half-SBU precursors to evaluate dynamics under various system compositions. We perform an unsupervised clustering analysis of half-SBUs using a graph-based model 58 in order to identify MOF precursors emerging from solution, and calculate the properties of their interconnected structures. Finally, we carry out a principal component analysis (PCA) to identify the properties which largely determine how selfassembly proceeds under various conditions. This way we deconvolute the characterization of the complex MOF selfassembly process by projecting various properties on the low-dimensional space of principal components. This allows us to evaluate various solution compositions and group them based on the similarity of the resulting assembly mechanisms; thus, offering a mechanistic understanding of the early stages of MOF self-assembly.

METHODS
Simulation setup. Molecular Dynamics (MD) simulations were performed in explicit solvent. Water has been represented with the TIP3P model 59 and ions with the OPLS-AA force field 60 . Our analysis is carried out, apart from water, in N,N-dimethylformamide (DMF) using force field parameters compatible with OPLS-AA 60 obtained from the virtualchemistry.org database 61, 62 . The MOF half-SBUs were modelled using force field parameters obtained from a previous work 56 . Force field parameters are also included as Supporting Information (SI), in a text file (SI_FF.txt). A brief discussion of the force field is provided in the SI, section I, p. 2. The leapfrog integrator was used to propagate dynamics of the system with a time step of 2 fs. The LINCS 63 algorithm preserved the distances of bonds involving hydrogen atoms. The cut-off for non-bonded interactions is 10 Å. Long range electrostatics were treated using the Particle-Mesh Ewald (PME) 64 scheme. The Bussi-Donadio-Parrinello thermostat 65 and the Berendsen barostat 66 preserved the temperature and pressure at 493 K and 3,500 bar respectively for an equilibration period of 5 ns. Production molecular dynamics simulations followed using the Parrinello-Rahman Barostat 67 with a relaxation time of 2 ps in water and 4 ps in DMF. Production simulations were carried out for 100 ns.  Distance and adjacency matrices. At first, a (n x n) distance matrix is constructed, where n is the number of half-SBUs. The generic element of the matrix corresponds to the Euclidean distance between the central oxygen atoms of the i th and j th half-SBUs units. This distance was used as the argument of a step function to define the adjacency of two half-SBUs. A cutoff distance of 15 Å is chosen as this value lies between the first and second coordination shells emerging from the calculation of the pair radial distribution function between central oxygen atoms of half-SBUs. This is provided in the SI, section II, p. 3, Figure S2. In the adjacency matrix, the element is equal to the unity if the distance between and is below the cutoff otherwise it is set to zero. Examples of distance and adjacency matrices are discussed in the SI, section II, p. 3. Graph-based clustering. The clustering analysis of half-SBUs is performed by converting molecular structures into lower dimension graph representations. We consider the central oxygen atoms of the building units as the nodes of the graph. Then we connect these nodes advised by the adjacency matrix discussed in the previous paragraph. Then, we identify clusters as connected components in the graph, where the smallest cluster consists of two half-SBUs (dimer). 70 Departing form depth first search (DFS) 70 that can be used to identify clusters, we analyze the properties of clusters as components of a graph; hence enrich the information we have for each cluster. Also, we calculate the mass of each cluster as the sum of the masses of its constituent particles. Furthermore, the local environment of each half-SBU is characterized by enumerating the neighbors of each node. This is a measure of the coordination of half-SBUs in the cluster. Also, we calculate the number of "free" half-SBUs as isolated nodes, without any edges attached. This allows us to further calculate certain properties of the graph, such as the number of connections (average neighborhood degree of each node, named graph degree for simplicity), the extent of interconnected triplets (transitivity) and the number of nodes which connect with similar ones based on their degree (assortativity coefficient). The graph representation was constructed using the NetworkX Python library 58 . Spherical radius and radius of gyration. The spherical radius and the radius of gyration are calculated for each cluster. The first follows from the ideal process of including each cluster in a sphere. The radius of that sphere would then be half the maximum distance between any two metal centers in this cluster. The radius of gyration 71 provides insight into the distribution of mass in complex structures and it is used to calculate the fractal dimension of each cluster. Periodic boundary conditions in three dimensions are appropriately considered in all these calculations. Fractal Dimension. The fractal dimension D ! was computed using a power law approach 72, 73 in the form, Where M is the cluster mass, and R "#$ its mass-weighted radius of gyration. In single structures we consider all atoms within a spherical volume extending from the center of mass of a cluster. For all atoms included into spherical volumes of increasing radii, we calculate the total mass and the mass-weighted radius of gyration. The fractal dimension is then calculated as the angular coefficient of the linear function obtained by fitting logjR "#$ k against log(M). This is repeated for all clusters consisting of at least three half-SBUs. We consider spherical shells ranging from the minimum distance between centers of half-SBUs in the cluster increased by 1 Å, to the maximum distance increased by 5 Å. This allows to have enough atoms to calculate a meaningful radius of gyration for all volumes, with the largest radius containing the entire cluster. A similar analysis is applied to clusters that spontaneously emerge during simulation. We observe the growth of a fractal pattern by plotting the size of clusters against the corresponding radii of gyration. Then, we use the power-law relationship to calculate the fractal dimension of clusters formed along the simulation trajectory. In this case, however, we substitute the total mass M with the number of half-SBUs in the cluster. The non-mass-weighed Rgyr of the cluster is computed from the positions of the centers of mass of the half-SBUs. Diffraction patterns. We calculate the X-Ray powder diffraction (XRD) patterns of clusters emerging throughout the simulation using PyMatGen 74,75 . The wavelength of the Z-Ray source used is equal to Cuk-a radiation, l=1.54184 Å. We consider angles that differ less than 10 -5 radians to have the same intensity. We scale intensities so that the unity is the maximum value and scaled intensities less than 10 -5 were considered to be negligible. Hydrogen atoms are not present in the calculation to be consistent with the pattern of the experimental crystal structure. This analysis is performed for the five largest clusters at every ns of the simulation. Consequently, distance metrics can be used to evaluate the similarity of each structure with the reference 76, 77 . In this effort, we used various distance metrics such as the Euclidean, Hellinger, Cosine, City-Block (Manhattan), c 2 and Canberra distances 78 . Similarity is calculated as the difference of the distance, normalized by its maximum value in all simulations, from the unity 78 . Zero distance means highly similar to MIL-101(Cr) 24 , while distance equal to the unity means patterns are not similar to MIL-101(Cr) 24 . Effect of ions and solvent. Ions (Na + , F -) are added at concentrations of 0.035M in water and 0.075M in DMF. We chose small concentrations as it was suggested that this is the optimal balance between crystallinity and salt precipitation 56 . Furthermore, a different solvent (DMF) can significantly affect the energetics of conformational transformations of the solute 56, 57 . In this context, we can assess the effect of guest molecules on the collective assembly of half-SBUs. Finally, there is experimental evidence that studying the solution in the early stages of assembly can significantly improve our understanding of the mechanism of MOF synthesis 79 . Calculation of rates. We calculate nucleation and growth rate of cluster formation during self-assembly following the approach discussed by Yuhara et al. 80 . In more detail, we calculate rates per unit volume, J. These are computed directly from unbiased simulations by estimating the partial derivative of the number of clusters formed, N(t), with respect to simulation time, t, and then normalizing it by the volume of the simulation box V: In the expression above, the volume of the system shows negligible fluctuations during the simulation, hence it is considered constant (the average volume of the simulation box was used to calculate rates). We use a linear fit of the transient period at the start of the simulation (until the number of clusters reach a maximum) and then we calculate J from the slope. This corresponds to the rate of nucleation after it is normalized by V. Then, we model the decay in the number of clusters by fitting an exponential function. The growth rate is calculated by the exponent normalized by V. We use NumPy 81 for the fitting procedure. At last, we consider fitting errors as the root mean square deviation of the linear fit for the nucleation rate, and the one standard deviation error on the exponent for the growth rate. Principal Component Analysis. We calculate different sets of data which characterize our analysis of the early stages of MOF self-assembly. A dimensionality reduction can be performed while retaining meaningful information by projecting data on principal components that possess most of the variation of the dataset 82 . In this effort, we consider the average values of quantities corresponding to the largest cluster along the trajectory of each simulation. We standardize data to have a mean of zero and standard deviation equal to the unity. This is done to combine data that have different units and magnitude 83 . Then, we identify eigenvectors and eigenvalues of the covariance matrix 83 . At last, we project data on the eigenvectors (principal components) with the highest eigenvalues.

RESULTS
Simulations in pure water. We start our analysis with the system of MLA half-SBUs in pure water. A large cluster is formed leaving no "free" half-SBUs after approximately 70 ns. This cluster is highly ordered and forms pores. The Q interaction between half-SBUs 57 , is prevalent in this case. This conformer is entropically favored in pure solvent (water, DMF) and features p-p stacking interactions. We observe a rapid decrease in the number of clusters in the first 10 ns as almost half-SBUs are already connected with other two neighbors at this time. Also, we monitor the size of the five largest clusters. After approximately 70 ns the clusters start sintering, incorporating into the large cluster during this time. On average, each half-SBU that belongs to this cluster has three other neighboring half-SBUs. We believe that this high degree of interconnectivity is what holds this cluster intact for the rest of the simulation. The five largest clusters formed of MLA half-SBUs in pure water at the very early stages of the simulations are reported in Figure 2 along with their corresponding graph representations. Molecular structures of clusters emerging after 5 ns of the production simulation are provided in the SI, section III, Figures S3-S10, p. 4-8. An initially equiprobable distribution of half-SBUs in water results in the formation of two relatively smaller clusters. Also, these are less ordered than in the previous case where MLB and MLC were absent. Nevertheless, the clusters present higher dimensionality than the cluster emerging from the purely MLA system. The number of clusters is gradually decreased to 2, reaching a plateau after 50 ns. One of the clusters is almost 30% larger than the other (4,380 and 3,420 atoms respectively). The average number of neighbors each half-SBU has is the largest of all cases as it plateaus to a value greater than 5 after 50 ns. Structures emerging from the simulation of assembly in pure water along with the corresponding graph representations after 20 and 100 ns of the production simulations for both "purely MLA" and equiprobable half-SBU distributions are shown in Figure 3. Structures and graph representations of clusters in the rest of the simulations are available in the SI, section III, Figures S12-S14, p. 10-11. Also, we calculate the distribution of cluster sizes during the simulation, and this is provided in the SI, section IV, Figure  S24, p. 21. Furthermore, we used two clusters formed by MLA isomers in pure water after 100 ns and ran a relatively short MD simulation in vacuo. The reason is to evaluate whether clusters of higher dimension can be formed from 2D sheets, as the MIL-101 crystal is a three-dimensional network 84 . A cluster of increased fractal dimension emerged; hence such clusters could possibly develop after longer times of self-assembly. This is further discussed in the SI, section V, p. 22. At last, we note that we shall use the following code for abbreviations throughout the text and the SI. AW: "Purely MLA in pure water", EW: "Equiprobable MLA/B/C in pure water", AWI: "Purely MLA in water with ions", EWI: "Equiprobable MLA/B/C in water with ions", AD: "Purely MLA in pure DMF", ED: "Equiprobable MLA/B/C in pure DMF", ADI: "Purely MLA in DMF with ions", EDI: "Equiprobable MLA/B/C in DMF with ions". Simulations in water with ions. The introduction of ions (Na + , F -) considerably affects the dynamics of assembly. Ions promote numerous small clusters in contrast with pure water. Crystal-like SBUs are formed in the presence of ions; 56 hence ions can help in healing defects during assembly. Also, small clusters of higher dimensions form in this case as it is desirable 84 . At last, we observe small clusters forming during assembly under both half-SBU distributions; hence solute composition does not significantly affect the early stages of self-assembly in water with ions. In further detail, the number of clusters gradually reaches a plateau at 25 ns for MLA half-SBUs, while this happens much faster in the equiprobable half-SBU distribution. There are approximately 13 and 25 clusters in the "purely MLA" and the "equiprobable half-SBUs" cases respectively. The two largest clusters consist of 1,320 and 1,140 atoms when there are purely MLA half-SBUs and 780 and 480 atoms when there is an equiprobable distribution of half-SBUs. The average coordination of half-SBUs is similar, on average, for the largest cluster in the "purely MLA" case in presence of ions and in pure water. Nevertheless, smaller clusters (e.g., the 2 nd , 3 rd, and 4 th largest clusters) show increased coordination in presence of ions. Coordination is lower in the equiprobable half-SBU distribution in presence of ions than in pure water. Consequently, ions tend to decrease the number of half-SBU neighbors when MLB and MLC are also considered. The choice of a small concentration of ions is further validated by results obtained from a larger concentration as discussed in the SI, section VI, p. 23. Simulations in DMF. Solvent effects are investigated through simulation of the early stages of MOF self-assembly in DMF. The system, consisting of purely MLA half-SBUs, rapidly forms a large cluster where the Q configuration prevails, and linear chains are also formed. Pores form in a similar fashion as in water. In the equiprobable half-SBUs case, we see one long cluster forming in contrast with corresponding simulations in water. Consequently, pure DMF leads to a decrease in the dimensionality of the resulting structures. The number of clusters is rapidly decreased when purely MLA half-SBUs are present in DMF. Equiprobable half-SBUs lead to the formation of two large clusters (4,080 and 3,420 atoms respectively) between 30 and 50 ns. The average number of neighbors in the largest cluster plateaus at around 3 when purely MLA half-SBUs are present, while it is slightly higher when MLB and MLC are introduced. Simulations in DMF with ions. Ionic species show a completely different behavior in DMF when compared with water. The "purely MLA" system forms a large, ordered, and high-dimensional cluster compared with the one formed in absence of ions. A similar behavior is observed when ions are added when MLB, MLC are present. The time evolution of the number of clusters shows a similar trend in both half-SBU distributions resulting in 3-4 clusters after 100 ns. In "purely MLA", we observe a slightly higher degree of interconnectivity in the largest cluster as the number of neighbors is close to 3.3 against 3 in the "equiprobable half-SBUs" scenario. The evolution of the largest cluster, represented by a graph, during assembly in water, is shown in Figure 4. The relevant evolution during assembly in all other simulations in water and DMF, with or without ions is provided in the SI, section III, Figures S15-S21, p. 12-18. Additionally, the probability density of different cluster sizes is discussed in the SI, section IV, Figure S24, p. 21. Graph analysis. A graph representation of the system allows us to calculate of certain properties of the molecular network in every simulation. The equiprobable distribution of species results in higher numbers of connections per half-SBU, hence half-SBUs interact more than when MLA is the only species present. Also, the equiprobable distribution of species and the presence of ions leads to higher transitivity values; hence networks are more tightly connected as more triplets of interconnected members exist. In contrast, purely MLA systems have higher assortativity coefficients, except in the presence of ions in DMF. Consequently, MLA leads to assortative mixing where similar molecules (which have the same number of connections on average) are interconnected more. Overall, larger networks, formed when purely MLA isomers are present, feature members that connect to others with whom they share similarities (e. g. of the same number of connections), but they are not as strongly connected as smaller clusters emerging in the equiprobable distribution of species. The latter "communicate" more with others that are not similar; hence they tend to connect to molecules that share different properties in a way that can prove beneficial to heal defects in the longer term. Estimation of rates. Nucleation and growth rates are calculated for all simulations performed. Samples with purely MLA half-SBUs invariably result in faster nucleation and growth than the equiprobable distribution in pure solvent. In presence of ions in water, rates are similar for both distributions. In water, ions tend to significantly slow down nucleation. However, ions result in faster growth in the equiprobable MLA/B/C distribution and slightly slower growth for MLA half-SBUs than in pure water. MLA nucleation is much faster in water than in pure DMF, while the opposite occurs for growth. MLA/B/C also nucleate faster in pure water, while they grow slightly faster in pure DMF. Ions result in similar nucleation rates for MLA in water and DMF, while in presence of ions MLA/B/C nucleate slightly faster in DMF than in water. At last, ions induce faster growth in water than in DMF. Details on the values of nucleation and growth rates along with an example calculation are provided in the SI, section VII, Figures S27-28, Table S1, p. 24-25. Fractal growth process. We calculate the radii of gyration of clusters (see Figure 4) formed and growing in time during the simulation. We then relate these values with their corresponding size expressed in terms of number of half-SBUs. In this manner, we can calculate the fractal dimension associated with the growth process (see Figure 5). In pure water, we observe that the "purely MLA" system evolves by forming two-dimensional structures as the corresponding fractal dimension is close to 2. Introduction of MLB and MLC leads to a slightly higher fractal dimension, while ions result in an overall decrease of the fractal dimension. On the other hand, in DMF, similar fractal dimension values are obtained for both solute compositions.
However, these are appreciably lower than the ones in pure water; hence DMF decreases the fractal dimension related with self-assembly. Spectator ions in DMF further induce a slight decrease in the fractal dimension related with growth. Diffraction pattern. In this section, we assess structural similarity based on the diffraction patterns of clusters formed during the simulation and that of MIL-101(Cr) 24 (reference structure). We perform this analysis both on clusters formed at the end of our simulations as well as the 5 largest clusters emerging every nanosecond during the whole simulation trajectory. Clusters formed out of MLA units are more similar to the reference based on cosine similarity. We note that the cosine similarity differentiates between the clusters emerging over time under various system compositions and it is widely used to compare spectra. 85-87 Nevertheless, other similarity metrics have been assessed and they are provided in the SI, section VIII, p. 26-32. These do not show considerable differences between the various systems; hence they will not be discussed in this text. This is more pronounced in the case of pure water. Consequently, MLA promotes the crystallinity of clusters formed during assembly in pure water. Also, we see that in pure water some degree of crystallinity is observed for only the larger clusters identified in the simulation, while for a completely different system composition (MLA/B/C in DMF with ions), smaller clusters with some degree of crystallinity are formed. Principal Component Analysis. Descriptors of the nucleation process, extracted from simulations, can be projected on principal components, and help identify new collective variables (named PCs). The 1 st PC is calculated from pro-jecting the data to the 1 st principal component, i.e., the eigenvector with the largest eigenvalue, the 2 nd PC from the 2 nd principal component, i.e., corresponding to the 2 nd largest eigenvalue, etc. The nucleation rate and fractal dimension related to growth are correlated with both PCs. The average graph degree (tied to the local coordination environment) is mostly correlated with the 2 nd PC. Cosine similarity (a descriptor of crystallinity), cluster size, spherical radius, radius of gyration, assortativity coefficient, and transitivity are predominantly correlated with the 1 st PC. In summary, the 1 st PC describes long-range order as it is strongly correlated with the similarity between the XRD patterns of emerging clusters and MIL-101(Cr). The 2 nd PC describes the local coordination environment as it is strongly correlated with the average graph degree. At last, the 3 rd PC is strongly correlated with the growth rate. Systems projected on the PCs, correlation coefficients between properties and the PCs are presented in Figure 5. An explanation of how we performed this analysis and PCA at the very early stages of assembly are provided in the SI, section IX, p. 33-35.

CONCLUSIONS AND OUTLOOK
The early stages of MOF self-assembly are investigated through molecular simulation that allows us to identify key interactions during nucleation at the molecular level. We assess the effects of building unit distribution, solvent, and spectator ions on MIL-101 self-assembly by 100 nslong simulations of more than 780,000 atoms. Clusters of building units are then characterized as an evolving graph. Our results corroborate previous studies, where the solution composition considerably influences the dynamical rearrangement of building units. 56, 57 Consequently, our conclusions confirm these phenomena at large length scales where the complexity is significantly increased.
This increased complexity can be deconvoluted through graph theory which we use to monitor and characterize growth at the early stages by projecting trajectories of atomic positions on lower-dimension graphs. The graphs are used to interpret how the connectivity, size, and morphology of clusters evolve during assembly. As a result, this work sets the base for further analysis of nucleation using data science to evaluate dynamics as the propagation of a coarse-grained graph model.
We conclude that less than a third of molecular descriptors suffice to account for 90% of the variance of the dataset. Consequently, we successfully identify nucleation descriptors that capture the evolution of both local and extended features. The 1 st principal component is correlated with the cosine similarity that can be calculated from powder diffraction patterns, and measures long-range order. The latter can be monitored by time-resolved diffraction. 24, 35, 43-47, 88, 89 The 2 nd principal component is strongly correlated with the local structural environment of the half-SBU, a variable that is accessible through X-Ray absorption and scattering methods. 35, 88, 89 The 3 rd principal component characterizes the rate of cluster growth that can be experimentally measured. 90, 91 Ultimately, these descriptors can also form a basis for collective variables to simulate nucleation via enhanced sampling. 56, 92 This opens up the exciting possibility to monitor the evolution of these graphs both experimentally and computationally.

Supporting Information
The Supporting Information is available free of charge on the ACS Publications website.
Supplementary information (SI) linked with the main article (PDF) Supplementary information (FF_SI) including force field parameters (Text File)

Notes
The authors declare no competing financial interest.

ACKNOWLEDGMENT
The work described in this publication was performed by Pacific Northwest National Laboratory, which is operated by Battelle for the United States Department of Energy under Contract DE-AC05-76RL0180. V.-A. G. and L. K. gratefully acknowledge support from by the U.S. Department of Energy (DOE), Office of Science, Office of Basic Energy Sciences, Chemical Sciences, Geosciences, and Biosciences Division, project 72353 (Interfacial Structure and Dynamics in Ion Separations). The authors acknowledge the use of the UCL Grace, Myriad and Kathleen High Performance Computing Facility (Grace@UCL, Myriad@UCL and Kathleen@UCL), and associated support services, in the completion of this work. We are grateful to the UK Materials and Molecular Modelling Hub for computational resources, which is partially funded by EPSRC (EP/P020194/1). This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science of the U.S. Department of Energy under Contract No. DE-SUPPLEMENTARY MATERIAL AC02-05CH11231.