Fair Benchmarking of Group Contribution and Machine Learning Models for Property Prediction: A New Data Splitting Strategy

Adem Rosenkvist Nielsen Aouichaoui; Jingkang Liang; Jens Abildskov; Gürkan Sin

doi:10.26434/chemrxiv-2025-3fx8d

Chemical Engineering and Industrial Chemistry

Search within Chemical Engineering and Industrial Chemistry

Fair Benchmarking of Group Contribution and Machine Learning Models for Property Prediction: A New Data Splitting Strategy

16 January 2025, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Accurate prediction of thermophysical properties is important in chemical engineering, where group-contribution models (GCM) have been used extensively. Traditional GC-based models tend to use all available data for parameter estimation, preventing a fair comparison with machine learning (ML) methods that require separate training, validation, and testing data. In this study, we introduce a new data splitting algorithm which optimally partitions molecular datasets by ensuring comprehensive group representation in the training as well as representative chemical diversity by using Butina clustering. We extensively tested this new data splitting algorithm on GC methods for predicting critical properties (critical temperature, critical pressure and critical volume) and acentric factors across 739 organic compounds. We have benchmarked the GC models against a dozen ML algorithms including graph neural network (GNN) models. Results demonstrate that traditionally trained GC models lead to performance overestimation. GNNs consistently outperform other methods on the external test dataset, achieving lower errors than both traditional and ML-enhanced GC methods. This work establishes a fair benchmarking standard for comparing GC and ML-based property prediction models, facilitating a more reliable assessment of new methods.

Keywords

property prediction

critical point properties

thermodynamics

graph neural networks

group-contribution

machine learning

Supplementary materials

Title

Description

Actions

Title

supplementary material

Description

an overview of the hyperparameter space for the various machine learning models

Actions

Supplementary weblinks

Title

Description

Actions

Title

github repo

Description

A github repo with the various code and generated results relating to the paper

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jan 16, 2025 Version 1

Metrics

349

151

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2025-3fx8d

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Fair Benchmarking of Group Contribution and Machine Learning Models for Property Prediction: A New Data Splitting Strategy

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share