Identifying organic co-solvents via machine learning solubility predictions in organic solvents and water

Maurycy Krzyzanowski; Sirazam Munira  Aishee; Nirala Singh; Bryan R.  Goldsmith

doi:10.26434/chemrxiv-2025-xlt1q

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Identifying organic co-solvents via machine learning solubility predictions in organic solvents and water

30 January 2025, Version 1

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Developing predictive models of solubility is useful for accelerating solvent selection for applications ranging from electrochemical conversion of organics to pharmaceutical drug development. Herein, we report the development of a machine learning (ML) workflow for identifying organic co-solvents to increase the concentration of hydrophobic molecules in aqueous mixtures. This task is of particular interest for the electrocatalytic conversion of biomass and bio-oils into sustainable fuels, which faces challenges due to the low aqueous solubility of the feedstock. First, we predict the miscibility of potential co-solvents in water, and we only consider co-solvents that are miscible. Second, we rank the co-solvents based on their ability to solubilize the molecules of interest. As such, we train two ML models on the AqSolDB and the BigSolDB datasets to predict the aqueous solubility (S) and the organic solubility (x), respectively. We select the Light Gradient Boosting Machine model architecture for aqueous solubility (test R2 = 0.864, RMSE = 0.851 log(S / (mol/dm3)) and organic solubility (test R2 = 0.805, RMSE = 0.511 log(x)) predictions based on comparing different ML models and features. We examine the generalizability of the organic solubility model on unseen solutes both quantitatively and qualitatively. We evaluate the utility of this ML workflow by identifying co-solvents for benzaldehyde and limonene—two hydrophobic molecules that are relevant for sustainable fuel production—and validate our predictions via experimental solubility measurements.

Keywords

Machine Learning

Solubility Prediction

Organic Solvents

Co-solvents

Supplementary materials

Title

Description

Actions

Title

Supplementary Information

Description

The supplementary information includes additional details about feature selection, data preprocessing, model training and validation, and experimental solubility estimation. It also contains tables and figures supplementary to the manuscript.

Actions

Supplementary weblinks

Title

Description

Actions

Title

Source Code

Description

The source code of ML models is available via GitHub

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Apr 08, 2025 Version 2

Jan 30, 2025 Version 1

Metrics

961

436

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2025-xlt1q

Funding

Office of Naval Research

N00014-23-1-2439

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Identifying organic co-solvents via machine learning solubility predictions in organic solvents and water

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share