Theoretical and Computational Chemistry

Size-Extensive Molecular Machine Learning with Global Descriptors


Machine learning (ML) models are increasingly used to predict molecular prop- erties in a high-throughput setting at a much lower computational cost than con- ventional electronic structure calculations. Such ML models require descriptors that encode the molecular structure in a vector. These descriptors are generally designed to respect the symmetries and invariances of the target property. However, size- extensivity is usually not guaranteed for so-called global descriptors. In this contri- bution, we show how extensivity can be build into ML models with global descriptors such as the Many-Body Tensor Representation. Properties of extensive and non- extensive models for the atomization energy are systematically explored by training on small molecules and testing on small, medium and large molecules. Our result shows that the non-extensive model is only useful in the size-range of its training set, whereas the extensive models provide reasonable predictions across large size differences. Remaining sources of error for the extensive models are discussed.


Thumbnail image of manuscript.pdf