Fair Benchmarking of Group Contribution and Machine Learning Models for Property Prediction: A New Data Splitting Strategy

16 January 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Accurate prediction of thermophysical properties is important in chemical engineering, where group-contribution models (GCM) have been used extensively. Traditional GC-based models tend to use all available data for parameter estimation, preventing a fair comparison with machine learning (ML) methods that require separate training, validation, and testing data. In this study, we introduce a new data splitting algorithm which optimally partitions molecular datasets by ensuring comprehensive group representation in the training as well as representative chemical diversity by using Butina clustering. We extensively tested this new data splitting algorithm on GC methods for predicting critical properties (critical temperature, critical pressure and critical volume) and acentric factors across 739 organic compounds. We have benchmarked the GC models against a dozen ML algorithms including graph neural network (GNN) models. Results demonstrate that traditionally trained GC models lead to performance overestimation. GNNs consistently outperform other methods on the external test dataset, achieving lower errors than both traditional and ML-enhanced GC methods. This work establishes a fair benchmarking standard for comparing GC and ML-based property prediction models, facilitating a more reliable assessment of new methods.

Keywords

property prediction
critical point properties
thermodynamics
graph neural networks
group-contribution
machine learning

Supplementary materials

Title
Description
Actions
Title
supplementary material
Description
an overview of the hyperparameter space for the various machine learning models
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.