Designing solvent systems in chemical processes using self-evolving solubility databases and graph neural networks

Yeonjoon Kim; Hojin Jung; Sabari Kumar; Robert S. Paton; Seonah Kim

doi:10.26434/chemrxiv-2022-sq34x-v3

Organic Chemistry

Search within Organic Chemistry

Designing solvent systems in chemical processes using self-evolving solubility databases and graph neural networks

07 July 2023, Version 3

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Designing solvent systems is key to achieving the facile synthesis and separation of desired products from chemical processes, so many machine learning models have been developed to predict solubilities. However, breakthroughs are needed to address deficiencies in the model’s predictive accuracy and generalizability; this can be addressed by expanding and integrating experimental and computational solubility databases. To maximize predictive accuracy, these two databases should not be trained separately, and they should not be simply combined without reconciling the discrepancies from different magnitudes of errors and uncertainties. Here, we introduce self-evolving solubility databases and graph neural networks developed through semi-supervised self-training approaches. Solubilities from quantum-mechanical calculations are referred to during semi-supervised learning, but they are not directly added to the experimental database. Dataset augmentation is performed from 11,637 experimental solubilities to >900,000 data points in the integrated database, while correcting for the discrepancies between experiment and computation. Our model was successfully applied to study solvent selection in organic reactions and separation processes. The accuracy (mean absolute error around 0.2 kcal/mol for the test set) is quantitatively useful in exploring Linear Free Energy Relationships between reaction rates and solvation free energies for 11 organic reactions. Our model also accurately predicted the partition coefficients of lignin-derived monomers and drug-like molecules. We anticipate this approach will be attractive to other areas of predictive chemistry where experimental, computational, and any other heterogeneous data sources should be combined.

Keywords

Solubility

Machine Learning

Semi-supervised Learning

Data Augmentation

Graph Neural Networks

Supplementary materials

Title

Description

Actions

Title

Supplementary Information

Description

Detailed information regarding the training results, analysis, and application of the graph neural network models trained via semi-supervised distillation.

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jul 07, 2023 Version 3

May 08, 2023 Version 2

Oct 25, 2022 Version 1

Version Notes

The main text has been polished, and more results have been added. The supplementary information file has been added to provide more detailed results. The Acknowledgements section has been added.

Metrics

2,774

1,374

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2022-sq34x-v3

Funding

NSF Extreme Science and Engineering Discovery Environment (XSEDE)

TG-CHE210034

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Designing solvent systems in chemical processes using self-evolving solubility databases and graph neural networks

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share