Identifying Physico-Chemical Laws from the Robotically Collected Data

A mixed-integer nonlinear programming (MINLP) formulation for symbolic regression was proposed to identify physical models from noisy experimental data. The formulation was tested using numerical models and was found to be more efficient than the previous literature example with respect to the number of predictor variables and training data points. The globally optimal search was extended to identify physical models and to cope with noise in the experimental data predictor variable. The methodology was coupled with the collection of experimental data in an automated fashion, and was proven to be successful in identifying the correct physical models describing the relationship between the shear stress and shear rate for both Newtonian and non-Newtonian fluids, and simple kinetic laws of reactions. Future work will focus on addressing the limitations of the formulation presented in this work, by extending it to be able to address larger complex physical models. Abstract A mixed-integer nonlinear programming (MINLP) formulation for symbolic regression was proposed to identify physical models from noisy experimental data. The formulation was tested using numerical models and was found to be more efficient than the previous literature example with respect to the number of predictor variables and training data points. The globally optimal search was extended to identify physical models and to cope with noise in the experimental data predictor variable. The methodology was coupled with the collection of experimental data in an automated fashion, and was proven to be successful in identifying the correct physical models describing the relationship between the shear stress and shear rate for both Newtonian and non-Newtonian fluids, and simple kinetic laws of reactions. Future work will focus on addressing the limitations of the formulation presented in this work, by extending it to be able to address larger complex physical models.

to cope with noise in the experimental data predictor variable. The methodology was coupled with the collection of experimental data in an automated fashion, and was proven to be successful in identifying the correct physical models describing the relationship between the shear stress and shear rate for both Newtonian and non-Newtonian fluids, and simple kinetic laws of reactions. Future work will focus on addressing the limitations of the formulation presented in this work, by extending it to be able to address larger complex physical models.

Introduction
Today we experience rapid development of a new field of chemical science -digital molecular technology (DMT). It is evident by the increasing number of publications in which synthetic and computational chemistry, or materials development are mixed with machine learning (ML), robotics and artificial intelligence (AI), for example in Refs. [1][2][3][4][5][6]. DMT is promising material science [24]. Although successfully proven, the proposed SR was based on a heuristic search that could terminate the optimization in local minima solutions, producing potentially less suitable models than possible. Additionally, as the structure of a model reflects the actual mechanistic interactions within the system studied, these approximate solutions cannot be used reliably to infer any mechanistic information about the system, i.e. to use it to identify chemical reactions mechanisms with certainty. Acknowledging this disadvantage, SR is formulated as a mixed-integer nonlinear programming (MINLP) in Ref. [25,26] and solved to global optimum.
However, the method remained in the mathematical domain and has yet to be applied to physical models and noisy experimental data.
This paper aims to advance the method of globally optimal symbolic regression towards automated, data-driven identification of physical models, and its applications to chemical engineering case studies. Compared to additive models in conventional regression and heuristic searches in SR, the globally optimal data-driven modelling technique, without any previously imposed model structure, is expected to discover true underlying relationships more reliably.
To accomplish this, a modified optimization formulation of SR is developed and implemented in combination with a framework for physical model selection. As a proof of concept, several case studies were investigated in the areas of rheology and reaction engineering. The purpose is to illustrate an automated research pipeline deriving interpretable and generalizable models and thereby providing access to physical knowledge from data generated in robotic experiments. Within this big picture, closing the loop of utilizing the obtained physical models 4 in further experimentation and generation of physical knowledge by (automated) interpretation remains for future research, as shown in Figure 1.

MINLP formulation
Use will be made of the Directed Acyclic Graph (DAG) description of algebraic functions throughout this work [27]. The MINLP formulation is based on a balanced binary tree superstructure for the representation of the equations describing a physical model. The overall goal is to enable the assembly of free-form algebraic functions by connecting predictor variables and operators, such that the resulting function predicts the dependent variable values accurately. As an example, the structure with nodes in a four-layer balanced binary tree is shown in Figure 2a. The formulation presented here is based on Ref [26], but follows a different concept in the setup of the binary tree in order to reduce the number of binary variables in the global optimization. These modifications are addressed in Section 2.2.
An expression tree consists of = 2 $% − 1 nodes, where defines the number of layers.
All nodes that have a connection with nodes on a lower level, their children nodes, are called branch-nodes ( + ) or non-leaf nodes and house a mathematical operator. The nodes on the lowest layer in the tree, referred to as leaf-nodes ( . ), are assigned to a predictor variable 0,2 or a constant 4 . In the following sections, we will refer to the total number of activated nodes in the tree as "complexity" of the model (except the ones with an identity operator). Each given data point deployed in the SR is described by two parameters: the value of the predictor variables 0,2 and the dependent variable value 2 , which is predicted by the model for each training data point j. As shown in Figure 2b, the input variables are assigned for selection only at the leaf-nodes, while the dependent variable values are used at the root node ( = 1) for comparison with the model prediction.
Each node has a value for each data point 4,2 , which is computed to be used as operator arguments on the layer above.  Consequently, by using the tree structure and the index assignment given (Tables 1-3), the optimization problem was formulated with the objective to minimize the sum of squared errors (SSE) between the values computed by the model and the experimental data, Eq. 1, according to Ref. [26].
Eqs. 2-4 enable the operator selection at branch-nodes with a big-M approach, as proposed in Refs. [26] and [28]. The idea of the big-M facilitates the transformation of the disjunctive choice between the operators into linear constraints [28]. If no operator is selected, its nodal value is set to zero by constraints 5-6. Additionally, either none or one operator can be selected at the branch-nodes. Hence, the sum of operator binaries must be less or equal to one, which is constrained by Eq. 7.
In contrast to the branch-node values, the values at the leaf-nodes are determined by equality constraints including the binary selection of predictor variables or constants, Eqs. 8-9. Also, Eqs. 10-11 make sure that either no operand, one variable or one constant can be assigned.
Overall, the model should include at least one predictor variable, which is ensured by Eq. 12.
For the purpose of completeness, Eqs. 13-17 depict the bounds on decision variables of the MINLP [26]. > X 4,2 , @4,2 , @4]E,2 Y ≤ 4,2,> :; X1 − 4,> Y , ∈ + , ∈ ℱ, ∈ (2) > X 4,2 , @4,2 , @4]E,2 Y ≥ 4,2,> .9 X1 − 4,> Y , ∈ + , ∈ ℱ, ∈ Due to the binary architecture and the commutative nature of addition and multiplication, the expression tree contains many mathematically invariant models (symmetries). The design of the formulation should therefore impede redundancies. Eqs. 18-19 resemble cuts in the tree such that, if a unary operator is selected, the unused part towards the right children node is set to zero [26]. The Eqs. 20-23 assure that if an operator is selected on a lower layer of the expression tree, there is an operator attached to the parental node [26]. Likewise, it ensures that the children of a node with value zero also have no operator or variables attached.
Additionally, symmetry breaking cuts (SC) to remove redundant solutions, which are caused by the commutative nature of addition and multiplication, were implemented. Eq. 24 is sufficient for one data point j = j′ to impose an order on the values of the children nodes [26].
The symmetry breaking cuts also posed as big-M constraints where M g * is set using interval arithmetic on the bounds of the two children node values [28].
Without a doubt and from experience with big-M formulations in the Mathematical Programming community, this will lead to rather loose lower bounding in the associated Branch-and-Bound (B&B) traditionally used to solve Mixed-Integer Linear (MILP) and Mixed-Integer Nonlinear Programming (MINLP) problems. Indeed, such were the observations reported later in the computational results of this work, and hence the serious limitations that is a challenge for future development of this rigorous methodology.

Details of Modifications of the Formulation
The previously reported MINLP formulation [26] was modified in order to reduce the number of required binary variables. This is expected to advance the overall efficiency in solving the MINLP as less decision variables have to be determined in the global optimization. The main difference is that in the presented formulation the tree is always fully constructed for a given number of pre-defined tree layers NL. The predictor variables and constants can only be assigned to the lowest layer. The inclusion of an identity function allows to pass up the values without any change to a higher layer in the expression trees.
In comparison to that, in Ref. [26] predictor variables and constants are available for selection at every node in the expression tree superseding an identity function. If those leaf-node operands are selected on a higher layer, their children nodes as well as the other subsequent lower levels are discarded. Hence, the tree is not set up necessarily to the maximum of allowed

Solver
For the aims of this work, the choice of solvers is limited to those that can deterministically solve MINLPs to global optimum. According to Ref. [29], the general list of feasible nonconvex global MINLP solvers contains ANTIGONE [30], BARON [31], Couenne [32], LindoGlobal [33], and SCIP [34]. According to the results of the comparative solver study [26]. BARON

Physical Model Selection
The SR is to be performed with experimental systems data and it is to be acknowledged that all measurements have an error. Following a globally optimal approach targeting exclusively model accuracy (Eq. 1), errors are represented in the final model what is also known as overfitting. Hence, methodological measures have to be included to restrict the influence of errors on the final model and assure generalisation capabilities with low errors to unseen data.
In case of SR, the errors in the training data are propagated through the expression tree and the selected operators apply to the data including errors. The limited robustness to noise is especially prevalent among SR due to its maximal flexibility in constructing free-form models [35].
To only extract the relevant terms describing the main signal and to preferably exclude the errors, the complexity of the final model is penalized. Model complexity is restricted to a threshold C. The identity function does not add to the complexity of a model. Consequently, the true complexity must be discounted by nodes with an identity function assigned. This complexity criterion is included as additional constraint in the MINLP (Eq. 25).
By limiting the flexibility allowed, overfitting can be reduced and sparse models found. This also increases interpretability. Furthermore, this constraint filters mathematical invariants including more terms from the search space.
Next, it is proposed to identify a portfolio of the most accurate models with varying complexity C by solving multiple MINLPs in parallel to global optimality. Among the results, one model is selected that is as sparse as possible to allow interpretation and knowledge extraction but is also as complex as necessary to describe the underlying physical system without overfitting.
Hence, the portfolio models are to be compared with regard to validation error. Due to the growing flexibility, the training error is assumed to be the lowest for the model with the highest allowed complexity. Without requiring assumptions about the underlying true model, these can be compared quantitatively by a data set for validation to check for overfitting. With the purpose of also comparing extrapolation capabilities of the models, the validation data set is created by extracting the data points at extrema of the predictor variables. Finally, the model selection can be based on lowest validation error which also determines the required model complexity. The framework is illustrated in Figure 3.

Reaction kinetic experiments
Reaction kinetic data was collected for hydrolysis of 4-nitrophenyl acetate (PNPA) under basic conditions as a case study. For each experimental run, three stock solutions were prepared consisting of PNPA (at the desired concentration) in 3.0 % (v/v) aqueous methanol, 3 mol·L -1 KCl, and an aqueous NaOH solution at a fixed pH. The adopted conditions for each kinetic experiment are provided in the ESI. The 1 mL of each solution was directly mixed in a spectrophotometric agitated disposable cuvette. Absorption spectra (300-500 nm) were collected at fixed time intervals (Agilent, Cary 60). Absorption data at 400 nm were converted to PNP concentration. Calibration was carried out using different aqueous solutions at a known concentration of PNP at the same methanol and KCl concentrations and pH of the tested solutions. PNPA concentrations were calculated as its initial concentration minus the concentration of the formed product according to the literature [36,37], since no by-products formation was reported under the adopted conditions.

Proof of concept
In the first instance, the methodology described in Section 2.1. was applied to data without errors to assess global optimization in SR, gaining a deeper understanding of its performance.
For reasons of simplicity a function with the same structure of Arrhenius law was considered, but without units and physically relevant parameters (Eq. 27). Arrhenius law is applicable in rheology as well as in reaction kinetics.
Ten data points were randomly sampled in the interval (10, 40). The calculations were performed on an Intel® Core™ i5-3337U CPU @ 1.80 GHz processor. A schematic representation of a four-layers tree for Arrhenius law identification is shown in Figure 5.
In the second instance, the function shown in Eq. 29 was adopted to evaluate the impact of the modifications in the MINLP formulation (Section 2.2) in terms of CPU time until convergence.
In comparison to the previously reported formulation, [26] the influence of the number of included operators ( ), data points ( ), predictor variables ( ), and tree layers ( ) is studied. Figure 6 summarises the obtained results.  Figure 6b and compared with the previously reported formulation [26]. For both formulations an increase in CPU time was observed, and in all cases, our adapted formulation showed improved scalability in the number of data points.
As described in Section 2.2, the conceptual difference in the modified formulation allows to reduce the number of binary variables which are required to allocate the predictor variables in the tree. As a result, a difference in performance should be observed when increasing the number of predictor variables. The expected improvements became evident at higher quantities of potential predictors and are shown in Figure 6c.
As the last parameter, the influence of the number of allowed layers in the expression tree was considered. By growing the tree in terms of the number of layers, an exponentially increasing number of nodes is added. Accordingly, the number of variables and constraints increases exponentially. As a result, the increase in CPU time is also exponential, as shown in Figure 6d.
In the case of layer-scalability, the previous MINLP formulation is superior. The new proposed formulation timed out after a few hours for a number of layers greater than four. For the function under study a three-layered tree was sufficient and the ability to discard sub-layers is in favour over constructing the whole tree with identity functions. This advantage might diminish if more complex functions are sought within higher layered trees. Applicable to both formulations, this result confirms the high computational expense of SR due to the combinatorial search space. The exponential scale-up behaviour in tree layers could strongly limit the method's ability to identify more complex models.

Newton's Law of Viscosity
The data collected from a commercially available emulsion sample by means of the automated capillary viscometer were used to identify the simple linear relationship between shear stress ( š ) and shear rate (ṧ) at the wall of the tubing (Eq. 30).
For the parameter identification an expression tree with three layers, including the shear rate (̇š) as the only predictor variable, and ten experimental data points were used. Two data points The obtained portfolio of models, Table 4, initially consisted of five models. As the complexity is constrained by an upper bound (inequality), similar models with the same complexity are identified. These, together with invariant models at higher complexities, were neglected. To choose the best model among the identified ones, the prediction of the models was plotted together with the experimental data, see Figure 7, and the training and extrapolation errors were  The model identification was conducted with only ten data points, highlighting the sparsity of required data points of the presented method compared to other data-driven methods. This is especially beneficial in chemistry, where data points can be expensive to generate.

Non-Newtonian Power Law
Identification of a non-Newtonian power law was used to prove that the model selection framework favours higher complexity models where required. Eleven experimental data points were collected using 1% w/w aqueous carboxymethyl cellulose at different flow rates. As for the Newton's law identification, an expression three with three layers was used, including the With the afore-mentioned settings, the portfolio of six models, Table 5, was obtained within 13 min. The mathematically invariant and similar models were discarded. It is worth mentioning that the same power law was found for = {5,6,7}. The resulting portfolio consists of three different models of different complexity. experimental data points at the edges of the investigated range of shear rates were taken aside and used to evaluate the extrapolation performance of the obtained models. The three candidate models together with their performance on the training and validation data are shown in Figure   8. In this case, both the training and the extrapolation error decrease with the complexity indicating that the most complex power law is the most appropriate for the description of the experimental data.

First-order Kinetic Law
Previous examples show the potential of the adopted methodology to discover sparse and interpretable models to describe the viscous behaviour of different fluids with a limited amount of experimental data. Moreover, a simple procedure was proven to be effective to select the most appropriate model within the obtained portfolio. In the following the same procedure was applied to learn a kinetic model of a simple test reaction for which a large amount of experimental data was available.
According to literature [38], hydrolysis of carboxylic acid esters can be described by first order kinetic law, Eq. 33.
where [ ] is the molar concentration of the investigated compound (4-nitrophenyl acetate) and the kinetic constant ª can be expressed as shown in Eq. 34.
Under the adopted experimental conditions (pH > 10.52) the terms $ and -[ ] ] are negligible and the overall kinetic law is given in Eq. 35.
In the first attempt, experimental data were collected at a fixed pH of 10.52, at different PNPA concentrations. Concentrations vs time data series were pre-processed to obtained an approximation of the reaction rate over time using the centered difference approximation.
For this example, a three-layer tree structure was allowed, including [ ] as the only predictor variable. Due to the relatively low values of the measured reaction rates (10 -7 -10 -9 mol·L -1 ·s -1 ), they were expressed as mmol·L -1 ·h -1 . The model portfolio without doubling is summarized in Table 6. The identified model with the lower extrapolation error is the true underlying first-order kinetic law governing the physics of the chemical system.  According to the examples reported in Sections 3.4, invariant models were obtained for = 4 and 6, and discarded. The obtained portfolio of models is summarized in Table 7. As shown, both errors are of 2 or 3 orders of magnitude higher when = 3, whereas similar training errors were obtained when = 5 and = 7. However, the lowest extrapolation error suggests that the model with complexity = 5 is the most suitable one for the description of the kinetic behaviour of the system.

Conclusions
Based on a MINLP formulation for global SR reported in the mathematical domain, a different approach in setting up the superstructure was introduced to reduce the number of binary variables involved in globally optimal SR. In addition, this formulation is complemented with a framework to enable the automated identification of physical models from crude data.
The new approach was found to outperform the previously proposed ones in term of computational time when increasing the number of included operators, predictor variable and experimental data. As examples, the developed method allowed to correctly identify the models underlying the rheological behaviour of Newtonian and non-Newtonian fluids, as well as simple kinetic laws, also in the case of sparse data sets, which is a common scenario in chemical process development.
A significant limitation of the methodology was found in the exponential scale-up of the computational time for (a) an increasing number of adopted layers in the tree necessary to represent complex algebraic structures of analytical function type models, and (b) an increasing number of data points.
This serious issue of computational efficiency enhancement cannot be resolved by parallelization. This, at present, limits the identification of more complex models for which a larger number of algebraic operators is often needed. Work is currently underway on alternative approaches using rigorous mathematical programming, such as the one presented in this work, as well as complementary methodologies to derive globally optimal fitted model structures.

Supplementary material
Identifying physico-chemical laws from robotically collected data Instruments modules and is then averaged to reduce random errors. For calibration purposes, the Hagen-Poiseuille law for Newtonian liquids and the empirical viscosity models for the viscosity of glycerol-water mixtures at varying weight contents were implemented. Moreover, the functionalities to measure the viscosity at different shear rates automatically and the drying procedure after the cleaning with isopropanol were designated in the interface [1,2]. Figure S1. LabView front Panel -Graphical User Interface for the Automated Operation of the capillary viscometer.
Before the whole set-up was calibrated with the glycerol-water mixtures, the pressure sensor (Elveflow MPS3) was calibrated in combination with the NI-9702 module. This was necessary as the supplier calibration is based on their own Elveflow amplification and data acquisition system which was not available for the set-up. The sensor was calibrated with a static air pressure by closing the sensor outlet and applying a known pressure with the DRUCK digital pressure indicator. The voltage signals were then acquired in LabView for one to two minutes for each constant pressure within the interval (0,2] bar with steps of 0.1 bar. Based on the arithmetic mean for each, the calibration curve was plotted in Figure S2. The resulting linear fit was implemented in the LabView program, to convert the voltage signals into measured pressure. Figure S2. Calibration of the Pressure Sensor Elveflow MPS3 (0-2bar) with DRUCK DPI 600/IS.

Condition adopted for the kinetic experiments
The initial conditions for all the carried-out kinetic experiments are summarized in Table S1.
All the reaction mixtures contain 1 M KCl and 1% (v/v) MeOH.