Abstract
Aqueous solubility is one key property of a chemical compound that determines its possible use in different applications from drug development to materials sciences. In this work, we present an aqueous solubility prediction study that leverages a curated dataset merged from four distinct sources. This unified dataset encompasses a diverse range of organic compounds, providing a robust foundation for our investigation of solubility prediction. Our approach involves employing a variety of machine learning and deep learning models that combine an extensive array of chemical descriptors, fingerprints, and functional groups. This methodology is designed to address the complexities of solubility prediction, and it is tailored to achieve high accuracy and generalization. We tested the finalized model on a diverse dataset of 1282 unique organic compounds from the Husskonnen dataset. The results of our analysis demonstrate the success of our model, which, given an R2 value of 0.92 and an MAE value of 0.40, outperforms existing prediction methods for aqueous solubility on one of the most diverse datasets in the field.