Combining Physics-Based Protein–DNA Energetics with Machine Learning to Predict Interpretable Transcription Factor-DNA Binding

Carmen Al Masri; Jin Yu

doi:10.26434/chemrxiv-2025-mc5q4

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Combining Physics-Based Protein–DNA Energetics with Machine Learning to Predict Interpretable Transcription Factor-DNA Binding

07 May 2025, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Transcription factors (TFs) are essential regulators of gene expression, and variations in their target DNA sequences due to altering TF-DNA binding affinity and specificity lead to diseases ranging from developmental disorders to cancer. Computational methods that integrate physics-based models with machine learning (ML) hold promise to accurately predict protein–DNA binding affinities while ensuring interpretability and generalizability. Here, we present an approach combining all-atom molecular dynamics (MD) simulations and Molecular Mechanics-Generalized Born Surface Area (MMGBSA) energy calculations with ML model constructions (neural networks, random forests, support vector machines) to predict DNA binding affinities and specificities for the dimeric TF Myc/Max. Using high-quality experimental data from genomic-context protein-binding microarrays (gcPBM), we constructed a balanced dataset of 168 DNA sequences reflecting physiologically relevant genomic environments. Multiple independent simulations were conducted per sequence for each TF-DNA complex to capture structural dynamic and interaction properties, with physically essential energetic descriptors extracted, including van der Waals, electrostatic, solvation, hydrogen bonding, and additional energy corrections. Our models achieved a Pearson correlation of ~0.73 and a mean absolute error of 0.4, substantially improving upon conventional MMGBSA prediction. Feature importance analyses highlighted TF-DNA interfacial complementarity and hydrophobic interactions as primary determinants of binding affinity and specificity, though TF-DNA interfacial hydrogen bonding contributions remain to be better characterized physically for sequence dependency. This physics-informed ML framework thus aims at both predictive accuracy and mechanistic interpretability, paving the way toward universal scalable prediction of interpretable protein–DNA interactions.

Keywords

Supplementary materials

Title

Description

Actions

Title

Supplementary Material

Description

Supplementary data and figures

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

May 07, 2025 Version 1

Metrics

428

230

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2025-mc5q4

Funding

UC Cancer Research Coordinating Committee

C23CR5636

NSF

DMS1763272

Simons Foundation

594598

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Combining Physics-Based Protein–DNA Energetics with Machine Learning to Predict Interpretable Transcription Factor-DNA Binding

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share