Abstract
Efficient evaluation of electron repulsion integrals (ERIs) involving high-angular-momentum Gaussian basis functions is computationally challenging on graphical processing units (GPUs), as traditional recurrence-based integral algorithms generate numerous intermediates, causing significant register pressure and memory bottlenecks. In this Article, we present a high-performance, high-angular-momentum Coulomb-matrix J engine specifically optimized for GPU execution. Our approach introduces a novel GPU-optimized McMurchie-Davidson recurrence algorithm combined with a tailored integral batching scheme, designed specifically to jointly minimize intermediate storage requirements and redundant computation. By strategically partitioning high-angular-momentum ERIs classes into several carefully selected sub-batches, our approach transitions the associated integral evaluation kernels from memory-bound to compute-bound regimes, significantly enhancing computational throughput and reducing time to solution. Implemented in the Extreme-scale Electronic Structure System (EXESS), our algorithm achieves individual kernel speedups of up to 9x and improves overall J-matrix formation performance by up to 64% across a variety of increasing-size chemical systems, including polyglycine chains, water clusters, and boron nitride crystals, when using the cc-pVQZ quadruple-zeta basis set.
Supplementary materials
Title
Supplementary Material: A GPU accelerated J matrix engine for high angular momentum
Description
The supporting information contains numeric timings for the relevant figures included in the Article
as well as xyz files for all systems (water clusters, polyglycine chains, and boron nitride crystals) used for
performance benchmarking.
The kernel timings (from Fig. 8) for L >= 7 are presented in Table S1. Speedups of the number of batches
with minimum execution time compared with the timing without batching are reported if relevant.
The J formation timings for water clusters, glycine chains and boron nitride crystals (from Fig. 9) are
presented in Tables S2, S3 and S4 respectively.
Actions