P2MAT: A machine learning (ML) driven software for Property Prediction of MATerial

16 December 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Accurate prediction of melting points for pure molecules remains a significant challenge in predictive chemistry, with implications across various scientific fields, including materials science, drug discovery, and separations chemistry. Traditional methods, such as group contribution (GC) techniques, have shown limited success due to the complex relationship between molecular structure and melting point. In this study, we present a data-driven machine learning (ML) approach to predict the melting points of organic compounds, leveraging both 2D and 3D molecular descriptors. Our dataset comprises 19,811 compounds with 2D features and a subset of 4,568 compounds with additional 3D features. We employed feature selection methods, including pair-wise correlation, Boruta, and principal component analysis, to refine our feature set. Various ML models, including linear regression, ensemble-based regression (Random Forest, gradient- boosted regression, Extreme gradient-boosted regression), support vector regression, and deep learning, were evaluated for their predictive performance. The Extreme Gradient Boosted Regression (XGBR) model demonstrated superior performance with a mean absolute error (MAE) of 27.64 K for 2D features and 31.58 K for combined 2D and 3D features. Outlier detection and removal further improved model accuracy. Additionally, SHAP (SHapley Additive exPlanations) analysis provided insights into feature importance, enhancing model interpretability. Our results indicate that ML models, particularly XGBR, can significantly improve melting point predictions, offering a robust tool for the scientific community. Scientific Contribution: The P2MAT application capable to predict both melting point and boiling point from SMILEs string as inputs. The GUI is simple and easy to load in the system.

Keywords

QSPR modelling
property prediction
melting point
boiling point
machine learning
linear regression
ensemble
deep learning
explainable machine learning.

Supplementary materials

Title
Description
Actions
Title
P2MAT: A machine learning (ML) driven software for Property Prediction of MATerial
Description
This document contains additional experimental results and feature description table.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.