Abstract
Accurate prediction of thermophysical properties is important in chemical engineering, where group-contribution models (GCM) have been used extensively. Traditional GC-based models tend to use all available data for parameter estimation, preventing a fair comparison with machine learning (ML) methods that require separate training, validation, and testing data. In this study, we introduce a new data splitting algorithm which optimally partitions molecular datasets by ensuring comprehensive group representation in the training as well as representative chemical diversity by using Butina clustering. We extensively tested this new data splitting algorithm on GC methods for predicting critical properties (critical temperature, critical pressure and critical volume) and acentric factors across 739 organic compounds. We have benchmarked the GC models against a dozen ML algorithms including graph neural network (GNN) models. Results demonstrate that traditionally trained GC models lead to performance overestimation. GNNs consistently outperform other methods on the external test dataset, achieving lower errors than both traditional and ML-enhanced GC methods. This work establishes a fair benchmarking standard for comparing GC and ML-based property prediction models, facilitating a more reliable assessment of new methods.
Supplementary materials
Title
supplementary material
Description
an overview of the hyperparameter space for the various machine learning models
Actions
Supplementary weblinks
Title
github repo
Description
A github repo with the various code and generated results relating to the paper
Actions
View