Abstract
Transcription factors (TFs) are essential regulators of gene expression, and variations in their target DNA sequences due to altering TF-DNA binding affinity and specificity lead to diseases ranging from developmental disorders to cancer. Computational methods that integrate physics-based models with machine learning (ML) hold promise to accurately predict protein–DNA binding affinities while ensuring interpretability and generalizability. Here, we present an approach combining all-atom molecular dynamics (MD) simulations and Molecular Mechanics-Generalized Born Surface Area (MMGBSA) energy calculations with ML model constructions (neural networks, random forests, support vector machines) to predict DNA binding affinities and specificities for the dimeric TF Myc/Max. Using high-quality experimental data from genomic-context protein-binding microarrays (gcPBM), we constructed a balanced dataset of 168 DNA sequences reflecting physiologically relevant genomic environments. Multiple independent simulations were conducted per sequence for each TF-DNA complex to capture structural dynamic and interaction properties, with physically essential energetic descriptors extracted, including van der Waals, electrostatic, solvation, hydrogen bonding, and additional energy corrections. Our models achieved a Pearson correlation of ~0.73 and a mean absolute error of 0.4, substantially improving upon conventional MMGBSA prediction. Feature importance analyses highlighted TF-DNA interfacial complementarity and hydrophobic interactions as primary determinants of binding affinity and specificity, though TF-DNA interfacial hydrogen bonding contributions remain to be better characterized physically for sequence dependency. This physics-informed ML framework thus aims at both predictive accuracy and mechanistic interpretability, paving the way toward universal scalable prediction of interpretable protein–DNA interactions.
Supplementary materials
Title
Supplementary Material
Description
Supplementary data and figures
Actions