Identifying organic co-solvents via machine learning solubility predictions in organic solvents and water

30 January 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Developing predictive models of solubility is useful for accelerating solvent selection for applications ranging from electrochemical conversion of organics to pharmaceutical drug development. Herein, we report the development of a machine learning (ML) workflow for identifying organic co-solvents to increase the concentration of hydrophobic molecules in aqueous mixtures. This task is of particular interest for the electrocatalytic conversion of biomass and bio-oils into sustainable fuels, which faces challenges due to the low aqueous solubility of the feedstock. First, we predict the miscibility of potential co-solvents in water, and we only consider co-solvents that are miscible. Second, we rank the co-solvents based on their ability to solubilize the molecules of interest. As such, we train two ML models on the AqSolDB and the BigSolDB datasets to predict the aqueous solubility (S) and the organic solubility (x), respectively. We select the Light Gradient Boosting Machine model architecture for aqueous solubility (test R2 = 0.864, RMSE = 0.851 log(S / (mol/dm3)) and organic solubility (test R2 = 0.805, RMSE = 0.511 log(x)) predictions based on comparing different ML models and features. We examine the generalizability of the organic solubility model on unseen solutes both quantitatively and qualitatively. We evaluate the utility of this ML workflow by identifying co-solvents for benzaldehyde and limonene—two hydrophobic molecules that are relevant for sustainable fuel production—and validate our predictions via experimental solubility measurements.

Keywords

Machine Learning
Solubility Prediction
Organic Solvents
Co-solvents

Supplementary materials

Title
Description
Actions
Title
Supplementary Information
Description
The supplementary information includes additional details about feature selection, data preprocessing, model training and validation, and experimental solubility estimation. It also contains tables and figures supplementary to the manuscript.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.