Statistical Curation of Thermophysical Data: Resolving Conflicted Activity Coefficients in ILThermo for Machine Learning

Hedayat Haddadi; Adam Kloskowski

doi:10.26434/chemrxiv-2025-3lmfx-v2

Chemical Engineering and Industrial Chemistry

Search within Chemical Engineering and Industrial Chemistry

Statistical Curation of Thermophysical Data: Resolving Conflicted Activity Coefficients in ILThermo for Machine Learning

12 June 2025, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The precision of thermodynamic modeling for ionic liquid (IL)–solute systems is fundamentally reliant on the quality of experimental data. However, prevalent databases such as ILThermo frequently exhibit conflicting measurements for the same systems under identical temperature and pressure conditions. These disparities often arise from unaccounted experimental variables—including variances in instrumentation, measurement methodologies, or sample handling—that are inadequately documented in the metadata. Unlike controlled parameters such as temperature or pressure, these concealed inconsistencies introduce systematic biases that undermine data reliability and skew subsequent applications, including separation design, property prediction, and machine learning implementation. To tackle this issue, we propose a thermodynamically informed and statistically sound framework for identifying and resolving internal data conflicts. The approach synthesizes the Gibbs–Helmholtz equation with the Chow test for structural stability to assess the consistency of regression models across different subsets of experimental data. Significant deviations in regression coefficients (indicative of enthalpic and entropic behaviors) serve as flags for identifying and eliminating inconsistent data subsets. Importantly, this methodology does not rely on a predetermined reference; rather, it undertakes thorough pairwise comparisons to ascertain the most self-consistent subsets. This study focuses on establishing a reproducible and generalizable protocol for curating thermophysical data prior to any modeling efforts. As a practical demonstration, we apply this methodology to activity coefficient data, illustrating how physical consistency assessments can markedly enhance dataset integrity. The proposed approach provides a scalable framework for refining extensive experimental datasets, thereby establishing a foundation for more dependable thermodynamic analyses, modeling, and machine-learning applications.

Keywords

Data curation

Thermophysical Data

Statistical Curation

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jun 12, 2025 Version 2

Apr 08, 2025 Version 1

Version Notes

The manuscript was revised to clarify its scope and novelty. Previously, it may have appeared focused solely on dataset preparation, but its core contribution is a general framework for the statistical curation of thermophysical data, exemplified by activity coefficients from ILThermo. Methodology Focus: Emphasizes the framework that combines the Gibbs–Helmholtz relation with the Chow test for detecting and resolving inconsistencies. Enhanced Visualizations: Figures have been improved to clearly display Chow test results, R² distributions, and the conflict resolution protocol. Clarified Results: A reader-friendly description and clarification of the results Improved Title & Abstract: Updated to reflect methodological contributions and broader applicability to machine learning and thermodynamic modeling. Refined Language: Edited for clarity, precision, and alignment with the scientific message. These changes aim to correct misinterpretations and highlight the work as a general, scalable approach to improving the integrity of thermophysical datasets for modeling and machine learning.

Metrics

317

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2025-3lmfx-v2

Funding

Nobelium Joining Gdansk Tech Research Community

Agreement No. DEC-51/2023/IDUB/I.1

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Statistical Curation of Thermophysical Data: Resolving Conflicted Activity Coefficients in ILThermo for Machine Learning

Authors

Abstract

Keywords

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share