Abstract
Developing predictive models of solubility is useful for accelerating solvent selection for applications ranging from electrochemical conversion of organics to pharmaceutical drug development. Herein, we report the development of a machine learning (ML) workflow for identifying organic co-solvents to increase the concentration of hydrophobic molecules in aqueous mixtures. This task is of particular interest for the electrocatalytic conversion of biomass and bio-oils into sustainable fuels, which faces challenges due to the low aqueous solubility of the feedstock. First, we predict the miscibility of potential co-solvents in water, and we only consider co-solvents that are miscible. Second, we rank the co-solvents based on their ability to solubilize the molecules of interest. As such, we train two ML models on the AqSolDB and the BigSolDB datasets to predict the aqueous solubility (S) and the organic solubility (x), respectively. We select the Light Gradient Boosting Machine model architecture for aqueous solubility (test R2 = 0.864, RMSE = 0.851 log(S / (mol/dm3)) and organic solubility (test R2 = 0.805, RMSE = 0.511 log(x)) predictions based on comparing different ML models and features. We examine the generalizability of the organic solubility model on unseen solutes both quantitatively and qualitatively. We evaluate the utility of this ML workflow by identifying co-solvents for benzaldehyde and limonene—two hydrophobic molecules that are relevant for sustainable fuel production—and validate our predictions via experimental solubility measurements.
Supplementary materials
Title
Supplementary Information
Description
The supplementary information includes additional details about feature selection, data preprocessing, model training and validation, and experimental solubility estimation. It also contains tables and figures supplementary to the manuscript.
Actions