ChemRxiv
These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
ML_of_Melting_Temperatures.pdf (2.64 MB)

A Diversified Machine Learning Strategy for Predicting and Understanding Molecular Melting Points

preprint
submitted on 27.09.2019, 17:43 and posted on 30.09.2019, 19:03 by ganesh sivaraman, Nicholas Jackson, Benjamin Sanchez-Lengeling, Alvaro Vazquez-Mayagoitia, Alan Aspuru-Guzik, Venkatram Vishwanath, Juan de Pablo

The ability to predict multi-molecule processes, using only knowledge of single molecule structure, stands as a grand challenge for molecular modeling. Methods capable of predicting melting points (MP) solely from chemical structure represent a canonical example, and are highly desirable in many crucial industrial applications. In this work, we explore a data-driven approach utilizing machine learning (ML) techniques to predict and understand the MP of molecules. Several experimental databases are aggregated from the literature to design a low-bias dataset that includes 3D structural and quantum-chemical properties. Using experimental and polymorph-induced uncertainties, we derive a tenable lower limit for MP prediction accuracy, and apply graph neural networks and Gaussian processes to predict MP competitive with these error bounds. To further understand how MP correlates with molecular structure, we employ several semi-supervised and unsupervised ML techniques. First, we use unsupervised clustering methods to identify classes of molecules, their common fragments, and expected errors for each data set. We then build molecular geometric spaces shaped by MP with a semi-supervised variational autoencoder and graph embedding spaces, and apply graph attribution methods to highlight atom-level contributions to MP within the datasets. Overall, this work serves as a case study of how to employ a diversified ML toolkit to predict and understand correlations between molecular structures and thermophysical properties of interest.

Funding

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. Argonne National Laboratorys work was supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357.

Argonne National Laboratory Maria Goeppert Mayer Fellowship

Alan Aspuru-Guzik acknowledges support from the Office of Naval Research under the Vannevar Bush Faculty Fellowship as well as support from the Canada 150 Research Chairs program and Dr. Anders G. Frseth.

History

Email Address of Submitting Author

gsivaraman@anl.gov

Institution

ARGONNE NATIONAL LABORATORY

Country

USA

ORCID For Submitting Author

0000-0001-9056-9855

Declaration of Conflict of Interest

There are no conflicts to declare.

Exports