PySIDT: Subgraph Isomorphic Decision Trees for Molecular Property Prediction

Matthew S. Johnson; Hao-Wei Pang; Anna C. Doner; William H. Green; Judit Zador

doi:10.26434/chemrxiv-2024-vbh8g-v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

PySIDT: Subgraph Isomorphic Decision Trees for Molecular Property Prediction

02 July 2025, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Accurate molecular property prediction is important across all fields of chemistry. Deep neural networks (DNNs) have become increasingly popular due to their ability to train automatically, avoiding the incredibly tedious process of constructing and extending traditional property estimation schemes. However, DNNs require large amounts of training data, are challenging to interpret, require large amounts of memory to load even during inference, and have severe difficulties incorporating qualitative chemical knowledge, which are often desired for molecular property prediction tasks. Here we present PySIDT (https://github.com/zadorlab/PySIDT), a software for training and running inference on Subgraph Isomorphic Decision Trees (SIDTs). SIDTs are graphbased decision trees made of nodes associated with molecular substructures. Inference is done by descending target molecular structures down the decision tree to nodes with matching subgraph isomorphic substructures and making predictions based on the final (most specific) node matched. SIDTs scale down well to dataset sizes much smaller than is feasible for DNNs. As trees of molecular substructures, SIDTs are inherently readable and easy to visualize, making them easy to analyze. They are also straightforward to extend and retrain, facilitate uncertainty estimation, and enable easy integration of expert knowledge. We demonstrate the SIDT approach discussing its application to a diverse range of molecular prediction tasks: rate coefficient estimation, diffusion coefficient estimation, thermochemistry estimation, transition state bond stretch prediction, pKa prediction, stability of molecular structures, stability of surface structures, and prediction of surface lateral interaction energetics. Additionally, we demonstrate the power of the SIDT algorithms in two direct learning curve vanilla comparisons with the popular DNN-based software Chemprop on enthalpy of formation and rate coefficient prediction tasks. In particular, in the enthalpy of formation case, vanilla PySIDT is able to outperform vanilla Chemprop across the full range of training/validation set sizes out to 11,560 datapoints.

Keywords

Supplementary materials

Title

Description

Actions

Title

Supplementary Information

Description

Additional details about all examples and extended discussions of several aspects.

Actions

Title

Datasets and Code for Comparisons

Description

Provides the train, validation, test splits and orderings and the code for running the associated comparisons between PySIDT and Chemprop.

Actions

Supplementary weblinks

Title

Description

Actions

Title

PySIDT

Description

Github for PySIDT

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jul 02, 2025 Version 2

Sep 27, 2024 Version 1

Version Notes

We added two learning curve comparisons of PySIDT with Chemprop and some discussion of the solution of subgraph isomorphisms within PySIDT.

Metrics

577

266

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2024-vbh8g-v2

Funding

Basic Energy Sciences

DE-NA0003525

National Energy Research Scientific Computing Center

BES-ERCAP0026789

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

PySIDT: Subgraph Isomorphic Decision Trees for Molecular Property Prediction

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share