Abstract
The combination of machine learning (ML) models with chemistry-related tasks requires the description of molecular structures in a machine-readable way. The nature of these so-called molecular descriptors has a direct and major impact on the performance of ML models and remains an open problem in the field. Structural descriptors like SMILES strings or molecular graphs lack size-independence and can be memory intensive. Machine-learned descriptors can be of low dimensionality and constant size but lack physical significance and human interpretability. Sigma profiles, which are unnormalized histograms of the surface charge distributions of solvated molecules, combine physical significance with low dimensionality and size-independence, making them a suitable candidate for a universal molecular descriptor. However, their widespread adoption in ML applications requires open access to sigma profile generation, which is currently not available. This work details the development of an open-source software for generating sigma profiles. Also presented are studies on the effect of different settings on the efficacy of the generated sigma profiles at predicting thermophysical material properties when used as inputs to a Gaussian Process as a simple surrogate ML model. We find that a higher level of theory does not translate to more accurate results. We also provide further recommendations for sigma profile calculation and use in ML models.
Supplementary materials
Title
Open-Source Generation of Sigma Profiles: Impact of Quantum Chemistry and Solvation Treatment on Machine Learning Performance
Description
This document shows additional results for the effect of sigma profile averaging and quantum chemistry. It, also, includes additional sigma profile information metrics and how they are correlated to machine learning performance, effect of segment size on performance, and some comments on reproducibility.
Actions
Supplementary weblinks
Title
Sigma Profile Generator
Description
GitHub repository for the software developed in the working paper. The software is fully open-source and allows generating sigma profiles given only a molecular descriptor.
Actions
View