Harnessing Surrogate Models for Data-efficient Predictive Chemistry: Descriptors vs. Learned Hidden Representations

Guanming Chen; Thijs Stuyver

doi:10.26434/chemrxiv-2025-3l5q7

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Harnessing Surrogate Models for Data-efficient Predictive Chemistry: Descriptors vs. Learned Hidden Representations

12 June 2025, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Predictive chemistry often faces data scarcity, limiting the performance of machine learning (ML) models. This is particularly the case for specialized tasks such as reaction rate or selectivity prediction. A common solution is to use quantum mechanical (QM) descriptors—physically meaningful features derived from electronic structure calculations—to enhance model robustness in low-data regimes. However, computing these descriptors is costly. Surrogate models address this by predicting QM descriptors directly from molecular structure, enabling fast and scalable input generation for data-efficient downstream ML models. In this study, we compare two strategies for using surrogate models: one that feeds predicted QM descriptors into downstream models, and another that leverages the surrogate’s internal hidden representations instead. Across a diverse set of chemical prediction tasks, we find that hidden representations often outperform QM descriptors, particularly when descriptor selection is not tightly aligned with the downstream task. Only for extremely small datasets or when using carefully selected, task-specific descriptors do the predicted values yield better performance. Our findings highlight that the hidden space of surrogate models captures rich, transferable chemical information, offering a robust and efficient alternative to explicit descriptor use. We recommend this strategy for building data-efficient models in predictive chemistry, especially when feature importance analysis is not a primary goal.

Keywords

Predictive Chemistry

Machine Learning

Surrogate Models

Quantum Chemical Descriptors

Hidden Representations

Supplementary materials

Title

Description

Actions

Title

Supporting Information: Harnessing Surrogate Models for Data-efficient Predictive Chemistry: Descriptors vs. Learned Hidden Representations

Description

Supporting Information of the main text "Harnessing Surrogate Models for Data-efficient Predictive Chemistry: Descriptors vs. Learned Hidden Representations"

Actions

Supplementary weblinks

Title

Description

Actions

Title

GitHub Repository Hidden_vs_Desc

Description

GitHub Repository for "Harnessing Surrogate Models for Data-efficient Predictive Chemistry: Descriptors vs. Learned Hidden Representations"

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jun 12, 2025 Version 1

Metrics

452

218

Views

Downloads

Citations

License

The content is available under CC BY NC 4.0

DOI

10.26434/chemrxiv-2025-3l5q7

Funding

China Scholarship Council

No. 202406020083

Agence Nationale de la Recherche

ANR-22-CPJ1-0093-01

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Harnessing Surrogate Models for Data-efficient Predictive Chemistry: Descriptors vs. Learned Hidden Representations

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share