Trainable Data Embeddings Enable Multi-Fidelity Learning

Rick Oerder; Gerrit Schmieden; Jan Hamaekers

doi:10.26434/chemrxiv-2025-vx7nx

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Trainable Data Embeddings Enable Multi-Fidelity Learning

01 April 2025, Version 1

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

We present an approach for end-to-end training of machine learning models for structure-property modeling on collections of datasets derived using different DFT functionals and basis sets. This approach overcomes the problem of data inconsistencies in the training of machine learning models on atomistic data. We rephrase the underlying problem as a multi-task learning scenario. We show that conditioning neural network-based models on trainable embedding vectors can effectively account for quantitative differences between methods. This allows for joint training on multiple datasets that would otherwise be incompatible. Therefore, this procedure circumvents the need for re-computations at a unified level of theory. Numerical experiments demonstrate that training on multiple reference methods enables transfer learning between tasks, resulting in even lower errors compared to training on separate tasks alone. Furthermore, we show that this approach can be used for multi-fidelity learning, improving data efficiency for the highest fidelity by an order of magnitude. To test scalability, we train a single model on a joint dataset compiled from 10 disjoint subsets of the MultiXC-QM9 dataset generated by different reference methods. Again, we observe transfer learning effects that improve the model errors by a factor of 2 compared to training on each subset alone.

Keywords

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

May 19, 2025 Version 2

Apr 01, 2025 Version 1

Metrics

669

317

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2025-vx7nx

Funding

Fraunhofer-Gesellschaft

PREPARE 40-08394

Deutsche Forschungsgemeinschaft

CRC 1639 NuMeriQS - project no. 511713970.

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Trainable Data Embeddings Enable Multi-Fidelity Learning

Authors

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share