Toward Predicting Solubility of Arbitrary Solutes in Arbitrary Solvents. 1: Prediction of Density and Refractive Index Using Machine Learning Algorithms with correlation-group parallel feature analysis

01 November 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.


Density and refractive index (nD) are two important properties related to van der Waals energy of a molecule. Thus, accurate prediction of these two properties has a great value in both molecular mechanics force field development, and solvation free energy and solubility prediction of any arbitrary molecules. In this study, we gathered molecule characteristics information of roughly 5,000 organic compounds for density records and 4000 organic compounds for nD values. Subsequently, the distinct GAFF (General AMBER Force Field) descriptors and RDkit descriptors of the compounds were generated and then applied to train various prediction models with a variety of machine learning algorithms for both properties respectively. As a result, both GAFF and RDkit descriptors yielded various robust models with low average percent errors (APE), low root-mean-square errors (RMSE) and high correlation coefficients R-square, while RDkit showed slightly better performance for predicting both properties. We further optimized top models and conducted parallel feature analysis (PFA) to identify specific features in each descriptor which outstandingly contributed to model robustness. The final model RMSE is 0.071 g/cm3 for density prediction and 0.014 for nD prediction, the APE value is as low as 2.845% for density and 0.531% for nD, and R-square is 0.950 for density and 0.954 for nD. Note that the performance of our prediction models for both density and nD significantly outperforms all currently published studies, especially for those with a dataset containing more than 200 records. The successful prediction of the two key molecular properties paves the road towards accurately predicting solubility of an arbitrary solute in an arbitrary solvent, an endeavor not only facilitates pharmaceutical industry to develop better drug candidates, but also increases efficiency regarding overall wet lab work. Key predictors which contribute most to a specific model or model function were identified using both Shapley analysis and correlation-group parallel feature analysis (CG-PFA).


Refractive Index
Machine learning
solvation free energy


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.