Machine Learning Transition Temperatures from 2D Structure

A priori knowledge of melting and boiling could expedite the discovery of pharmaceutical, energetic, and energy harvesting materials. The tools of data science are becoming increasingly important for exploring chemical datasets and predicting material properties. A fundamental part of data-driven modeling is molecular featurization. Herein, we propose a molecular representation with group-constitutive and geometrical descriptors that map to enthalpy and entropy--two thermodynamic quantities that drive thermal phase transitions. The descriptors are inspired by the linear regression-based quantitative structure-property relationship of Yalkowsky and coworkers known as the Unified Physicochemical Property Estimation Relationships (UPPER). Combined with nonlinear machine learning (specifically, eXtreme Gradient Boosting or XGBoost), these concise and easy-to-compute descriptors provide an appealing framework for predicting transition enthalpies, entropies, and temperatures in a diverse chemical space. An application to energetic materials shows that UPPER plus XGBoost is predictive, despite a relatively modest energetics reference dataset. We also report results on public datasets of melting points (i.e., OCHEM, Enamine, Bradley, and Bergstrom). The newly proposed representation is determined purely from SMILES string, thus showing promise toward fast and accurate screening of thermodynamic properties.