High-Performance, High-Angular-Momentum J Engine on Graphics Processing Units

16 May 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Efficient evaluation of electron repulsion integrals (ERIs) involving high-angular-momentum Gaussian basis functions is computationally challenging on graphical processing units (GPUs), as traditional recurrence-based integral algorithms generate numerous intermediates, causing significant register pressure and memory bottlenecks. In this Article, we present a high-performance, high-angular-momentum Coulomb-matrix J engine specifically optimized for GPU execution. Our approach introduces a novel GPU-optimized McMurchie-Davidson recurrence algorithm combined with a tailored integral batching scheme, designed specifically to jointly minimize intermediate storage requirements and redundant computation. By strategically partitioning high-angular-momentum ERIs classes into several carefully selected sub-batches, our approach transitions the associated integral evaluation kernels from memory-bound to compute-bound regimes, significantly enhancing computational throughput and reducing time to solution. Implemented in the Extreme-scale Electronic Structure System (EXESS), our algorithm achieves individual kernel speedups of up to 9x and improves overall J-matrix formation performance by up to 64% across a variety of increasing-size chemical systems, including polyglycine chains, water clusters, and boron nitride crystals, when using the cc-pVQZ quadruple-zeta basis set.

Keywords

GPU
quantum chemistry
J-engine
ERI
DFT
Hartree-Fock

Supplementary materials

Title
Description
Actions
Title
Supplementary Material: A GPU accelerated J matrix engine for high angular momentum
Description
The supporting information contains numeric timings for the relevant figures included in the Article as well as xyz files for all systems (water clusters, polyglycine chains, and boron nitride crystals) used for performance benchmarking. The kernel timings (from Fig. 8) for L >= 7 are presented in Table S1. Speedups of the number of batches with minimum execution time compared with the timing without batching are reported if relevant. The J formation timings for water clusters, glycine chains and boron nitride crystals (from Fig. 9) are presented in Tables S2, S3 and S4 respectively.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.