Abstract
Food, water, air, and soil are regularly contaminated with natural and artificially occurring forms of arsenic, from which, arsonic acid derivatives RAsO(OH)2 are the major pentavalent compounds present in aqueous media. At a given pH, the resulting ionization state for these derivatives affects their lipophilicity, solubility, protein binding, and their ability to cross plasma membranes, potentially increasing their toxicity. Knowing their pKa values not only characterizes them but helps design a specific strategy for their bioremediation. There are numerous challenges associated with predicting pKa, and existing models are limited to certain chemical spaces. To leverage a pKa model for arsonic acids, we contrast machine learning (ML) methods based in Support Vector Machine and three DFT-based models: correlation to the maximum surface electrostatic potential (VS,max) at the ωB97XD/cc-pVTZ level of theory; correlation to carboxylate atomic charges in conjunction with a density-based solvation model (SMD) at the level of M06L/6-311G(d,p); and the scaled solvent-accessible surface approach, which yielded high mean unsigned errors for predicted pKa, and therefore it is not an efficient method for calculating the pKas of arsenic acids, in contrast with reported data for carboxylic acids, aliphatic amines, and thiols. The highest agreement was obtained with the atomic charges calculation on the conjugated arsonate base. ML based and Vs,max models rank second and third, respectively, in terms of prediction performance.
Supplementary materials
Title
Supporting Information
Description
Correlation tables for pka and Vs,max calculations, xyz coordinates for all optimized compounds,
Results of Genetic Algorithms
Actions
Title
ML descriptors
Description
Spreadsheet containing all Machine Learning descriptors and results as included in the supporting information pdf file
Actions