Abstract
Predictive chemistry often faces data scarcity, limiting the performance of machine learning (ML) models. This is particularly the case for specialized tasks such as reaction rate or selectivity prediction. A common solution is to use quantum mechanical (QM) descriptors—physically meaningful features derived from electronic structure calculations—to enhance model robustness in low-data regimes. However, computing these descriptors is costly. Surrogate models address this by predicting QM descriptors directly from molecular structure, enabling fast and scalable input generation for data-efficient downstream ML models. In this study, we compare two strategies for using surrogate models: one that feeds predicted QM descriptors into downstream models, and another that leverages the surrogate’s internal hidden representations instead. Across a diverse set of chemical prediction tasks, we find that hidden representations often outperform QM descriptors, particularly when descriptor selection is not tightly aligned with the downstream task. Only for extremely small datasets or when using carefully selected, task-specific descriptors do the predicted values yield better performance. Our findings highlight that the hidden space of surrogate models captures rich, transferable chemical information, offering a robust and efficient alternative to explicit descriptor use. We recommend this strategy for building data-efficient models in predictive chemistry, especially when feature importance analysis is not a primary goal.
Supplementary materials
Title
Supporting Information: Harnessing Surrogate Models for Data-efficient Predictive Chemistry: Descriptors vs. Learned Hidden Representations
Description
Supporting Information of the main text "Harnessing Surrogate Models for Data-efficient Predictive Chemistry: Descriptors vs. Learned Hidden Representations"
Actions
Supplementary weblinks
Title
GitHub Repository Hidden_vs_Desc
Description
GitHub Repository for "Harnessing Surrogate Models for Data-efficient Predictive Chemistry: Descriptors vs. Learned Hidden Representations"
Actions
View