Holistic Prediction of pKa in Diverse Solvents Based on Machine Learning Approach

The acid dissociation constant pKa dictates a molecule’s ionic status, and is a critical physicochemical property in rationalizing acid-base chemistry in solution and in many biological contexts. Although numerous theoretic approaches have been developed for predicating aqueous pKa, fast and accurate prediction of non-aqueous pKas has remained a major challenge. On the basis of iBonD experimental pKa database curated across 39 solvents, a holistic pKa prediction model was established by using machine learning approach. Structural and physical organic parameters combined descriptors (SPOC) were introduced to represent the electronic and structural features of molecules. With SPOC and ionic status labelling (ISL), the holistic models trained with neural network or XGBoost algorithm showed the best prediction performance with MAE value as low as 0.87 pKa unit. The holistic model showed better performance than all the tested single-solvent models (SSMs), verifying the transfer learning features. The capability of prediction in diverse solvents allows for a comprehensive mapping of all the possible pKa correlations between different solvents. The iBonD holistic model was validated by prediction of aqueous pKa and micro-pKa of pharmaceutical molecules and pKas of organocatalysts in DMSO and MeCN with high accuracy. An on-line prediction platform (http://pka.luoszgroup.com) was constructed based on the current model.