Statistical Curation of Thermophysical Data: Resolving Conflicted Activity Coefficients in ILThermo for Machine Learning

12 June 2025, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The precision of thermodynamic modeling for ionic liquid (IL)–solute systems is fundamentally reliant on the quality of experimental data. However, prevalent databases such as ILThermo frequently exhibit conflicting measurements for the same systems under identical temperature and pressure conditions. These disparities often arise from unaccounted experimental variables—including variances in instrumentation, measurement methodologies, or sample handling—that are inadequately documented in the metadata. Unlike controlled parameters such as temperature or pressure, these concealed inconsistencies introduce systematic biases that undermine data reliability and skew subsequent applications, including separation design, property prediction, and machine learning implementation. To tackle this issue, we propose a thermodynamically informed and statistically sound framework for identifying and resolving internal data conflicts. The approach synthesizes the Gibbs–Helmholtz equation with the Chow test for structural stability to assess the consistency of regression models across different subsets of experimental data. Significant deviations in regression coefficients (indicative of enthalpic and entropic behaviors) serve as flags for identifying and eliminating inconsistent data subsets. Importantly, this methodology does not rely on a predetermined reference; rather, it undertakes thorough pairwise comparisons to ascertain the most self-consistent subsets. This study focuses on establishing a reproducible and generalizable protocol for curating thermophysical data prior to any modeling efforts. As a practical demonstration, we apply this methodology to activity coefficient data, illustrating how physical consistency assessments can markedly enhance dataset integrity. The proposed approach provides a scalable framework for refining extensive experimental datasets, thereby establishing a foundation for more dependable thermodynamic analyses, modeling, and machine-learning applications.

Keywords

Data curation
Thermophysical Data
Statistical Curation

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.