Combining Physics-Based Protein–DNA Energetics with Machine Learning to Predict Interpretable Transcription Factor-DNA Binding

07 May 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Transcription factors (TFs) are essential regulators of gene expression, and variations in their target DNA sequences due to altering TF-DNA binding affinity and specificity lead to diseases ranging from developmental disorders to cancer. Computational methods that integrate physics-based models with machine learning (ML) hold promise to accurately predict protein–DNA binding affinities while ensuring interpretability and generalizability. Here, we present an approach combining all-atom molecular dynamics (MD) simulations and Molecular Mechanics-Generalized Born Surface Area (MMGBSA) energy calculations with ML model constructions (neural networks, random forests, support vector machines) to predict DNA binding affinities and specificities for the dimeric TF Myc/Max. Using high-quality experimental data from genomic-context protein-binding microarrays (gcPBM), we constructed a balanced dataset of 168 DNA sequences reflecting physiologically relevant genomic environments. Multiple independent simulations were conducted per sequence for each TF-DNA complex to capture structural dynamic and interaction properties, with physically essential energetic descriptors extracted, including van der Waals, electrostatic, solvation, hydrogen bonding, and additional energy corrections. Our models achieved a Pearson correlation of ~0.73 and a mean absolute error of 0.4, substantially improving upon conventional MMGBSA prediction. Feature importance analyses highlighted TF-DNA interfacial complementarity and hydrophobic interactions as primary determinants of binding affinity and specificity, though TF-DNA interfacial hydrogen bonding contributions remain to be better characterized physically for sequence dependency. This physics-informed ML framework thus aims at both predictive accuracy and mechanistic interpretability, paving the way toward universal scalable prediction of interpretable protein–DNA interactions.

Keywords

mmgbsa
binding affinity
machine learning
free energy

Supplementary materials

Title
Description
Actions
Title
Supplementary Material
Description
Supplementary data and figures
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.