Advancing Aqueous Solubility Prediction: A Machine Learning Approach for Organic Compounds Using a Curated Dataset

03 December 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Aqueous solubility is one key property of a chemical compound that determines its possible use in different applications from drug development to materials sciences. In this work, we present an aqueous solubility prediction study that leverages a curated dataset merged from four distinct sources. This unified dataset encompasses a diverse range of organic compounds, providing a robust foundation for our investigation of solubility prediction. Our approach involves employing a variety of machine learning and deep learning models that combine an extensive array of chemical descriptors, fingerprints, and functional groups. This methodology is designed to address the complexities of solubility prediction, and it is tailored to achieve high accuracy and generalization. We tested the finalized model on a diverse dataset of 1282 unique organic compounds from the Husskonnen dataset. The results of our analysis demonstrate the success of our model, which, given an R2 value of 0.92 and an MAE value of 0.40, outperforms existing prediction methods for aqueous solubility on one of the most diverse datasets in the field.

Keywords

Solubility
machine learning
chemical properties
water

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.