Enable seminumerical exact exchange calculation with CUDA by vmitq · Pull Request #183 · wavefunction91/GauXC

vmitq · 2026-03-13T12:51:17Z

This PR implements a new feature: the calculation of the seminumerical exact exchange contribution to the gradient. Currently, the EXX gradient is supported only for CUDA+cuBLAS.

Co-authored-by: Lukas Gergs <lg@terraquantum.swiss>

Copilot

Pull request overview

This PR adds support for computing the seminumerical exact-exchange (EXX) contribution to the nuclear gradient on CUDA device backends, and exposes it through the public XCIntegrator / replicated-integrator APIs.

Changes:

Add eval_exx_grad API plumbing across XCIntegrator and replicated integrator layers.
Add device storage + local-work driver support for EXX-gradient intermediates and reduction.
Extend the standalone driver to load/write/print/compare EXX_GRAD from HDF5 references.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tests/standalone_driver.cxx	Loads/writes/prints `EXX_GRAD` reference data and compares norms.
src/xc_integrator/xc_data/device/xc_device_stack_data.hpp	Adds device pointers and interface hooks for EXX-gradient buffers.
src/xc_integrator/xc_data/device/xc_device_stack_data.cxx	Allocates/zeros/retrieves EXX-gradient intermediates on device.
src/xc_integrator/xc_data/device/xc_device_data.hpp	Adds `exx_grad` term tracking and device-data virtual interface for EXX gradient.
src/xc_integrator/shell_batched/shell_batched_replicated_xc_integrator.hpp	Declares `eval_exx_grad_` in shell-batched replicated integrator.
src/xc_integrator/shell_batched/shell_batched_replicated_xc_integrator_exx_grad.hpp	Adds NYI stub for shell-batched EXX gradient.
src/xc_integrator/replicated/replicated_xc_integrator_impl.cxx	Adds replicated pimpl forwarding function `eval_exx_grad`.
src/xc_integrator/replicated/host/shell_batched_replicated_xc_host_integrator.cxx	Wires in the shell-batched EXX-gradient stub header.
src/xc_integrator/replicated/host/reference_replicated_xc_host_integrator.hpp	Declares `eval_exx_grad_` for reference host integrator.
src/xc_integrator/replicated/host/reference_replicated_xc_host_integrator.cxx	Wires in the reference EXX-gradient stub header.
src/xc_integrator/replicated/host/reference_replicated_xc_host_integrator_exx_grad.hpp	Adds NYI stub for reference host EXX gradient.
src/xc_integrator/replicated/device/shell_batched_replicated_xc_device_integrator.cxx	Wires in the shell-batched EXX-gradient stub header on device build.
src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator.hpp	Declares EXX-gradient evaluation and local-work helpers.
src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator.cxx	Includes the new EXX-gradient device implementation header.
src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator_exx_grad.hpp	Implements EXX gradient evaluation workflow on device.
src/xc_integrator/local_work_driver/device/scheme1_magma_base.hpp	Adds EXX-gradient driver API surface for MAGMA scheme (NYI).
src/xc_integrator/local_work_driver/device/scheme1_magma_base.cxx	Adds NYI throws for EXX-gradient operations under MAGMA.
src/xc_integrator/local_work_driver/device/scheme1_base.hpp	Extends scheme1 base interface with EXX-gradient ops.
src/xc_integrator/local_work_driver/device/scheme1_base.cxx	Implements EXX K-derivative accumulation and contraction to basis-function gradients.
src/xc_integrator/local_work_driver/device/local_device_work_driver.hpp	Exposes EXX-gradient operations on the local device work driver.
src/xc_integrator/local_work_driver/device/local_device_work_driver.cxx	Forwards EXX-gradient calls to the device-driver PIMPL.
src/xc_integrator/local_work_driver/device/local_device_work_driver_pimpl.hpp	Adds pure-virtual EXX-gradient hooks for device-driver implementations.
src/xc_integrator/local_work_driver/device/cuda/kernels/cublas_extensions.cu	Adds CUDA kernels for matrix row/column reductions used in EXX gradient assembly.
src/xc_integrator/local_work_driver/device/common/device_blas.hpp	Declares `matrix_reduce_rows/cols` device BLAS helpers.
src/runtime_environment/device/device_runtime_environment.cxx	Adds `DeviceRuntimeEnvironment` constructor taking an explicit byte count.
src/runtime_environment/device/device_runtime_environment_impl.hpp	Implements explicit-size device-buffer allocation constructor.
include/gauxc/xc_integrator/xc_integrator_impl.hpp	Adds `eval_exx_grad` to the virtual integrator interface and public wrapper.
include/gauxc/xc_integrator/replicated/replicated_xc_integrator_impl.hpp	Adds low-level replicated EXX-gradient virtual + public entrypoint.
include/gauxc/xc_integrator/replicated/impl.hpp	Implements `ReplicatedXCIntegrator::eval_exx_grad_` wrapper returning a vector.
include/gauxc/xc_integrator/replicated_xc_integrator.hpp	Plumbs `eval_exx_grad_` through replicated integrator type.
include/gauxc/xc_integrator/impl.hpp	Adds `XCIntegrator::eval_exx_grad` public wrapper.
include/gauxc/xc_integrator.hpp	Adds public `eval_exx_grad` API type and method declaration.
include/gauxc/runtime_environment/decl.hpp	Declares the new explicit-size `DeviceRuntimeEnvironment` constructor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T21:31:26Z

src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator_exx_grad.hpp

+    // Reduce results in device memory
+    this->timer_.time_op("XCIntegrator.Allreduce_EXX_GRAD", [&](){
+      this->reduction_driver_->allreduce_inplace(
+        device_data_ptr->exx_grad_device_data(), nbf, ReductionOp::Sum,


In the device-reduction path, the allreduce length is incorrect: exx_grad_device_data() points to a 3*nbf buffer (x/y/z), but the call reduces only nbf elements. This will leave 2/3 of the EXX gradient unreduced across ranks and can produce wrong gradients. Update the reduction size to 3*nbf (and ensure the reduction driver expects that count).

Suggested change

device_data_ptr->exx_grad_device_data(), nbf, ReductionOp::Sum,

device_data_ptr->exx_grad_device_data(), 3*nbf, ReductionOp::Sum,

Copilot · 2026-04-06T21:31:26Z

src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator_exx_grad.hpp

+  // Sum gradient contribution of basis functions
+  // to nuclei
+  rt.device_backend()->master_queue_synchronize();
+  auto& basis_map   = this->load_balancer_->basis_map();
+for (size_t i = 0; i< basis.nshells(); i++) {
+    const auto [b0, b1] = basis_map.shell_to_ao_range()[i];
+    const auto iCenter = basis_map.shell_to_center()[i];
+    for (size_t bf = b0; bf < b1; bf++) {
+        // factor 2 is for bra-ket integral symmetry
+        EXX_GRAD[3*iCenter  ] += 2*exx_bfgrad[bf  ];
+        EXX_GRAD[3*iCenter+1] += 2*exx_bfgrad[bf+1*nbf];
+        EXX_GRAD[3*iCenter+2] += 2*exx_bfgrad[bf+2*nbf];
+    }
+
+}
+}


eval_exx_grad_ accumulates into the caller-provided EXX_GRAD buffer using += without first zeroing/initializing it. This makes the result dependent on the initial contents of EXX_GRAD and can cause incorrect gradients if the buffer is reused. Consider explicitly zeroing EXX_GRAD[0..3*natoms) at the start of the routine (or switching to assignment) to match the typical output-parameter contract.

Copilot · 2026-04-06T21:31:27Z

tests/standalone_driver.cxx

+      if( integrate_exc_grad ) {
+        if( rks ) {
+          EXX_GRAD = integrator.eval_exx_grad( P, sn_link_settings );
+        }
+        else if( uks ) {
+          std::cout << "Warning: eval_exx_grad + UKS NYI!" << std::endl;
+        }
+        else if( gks ) {
+          std::cout << "Warning: eval_exx_grad + GKS NYI!" << std::endl;
+        }
+        if(!world_rank) {
+          std::cout << "EXX Gradient:" << std::endl;
+          std::cout << std::scientific << std::setprecision(6);
+          for( auto iAt = 0; iAt < mol.size(); ++iAt ) {
+            std::cout << "  "
+                      << std::setw(16) << EXX_GRAD[3*iAt + 0]
+                      << std::setw(16) << EXX_GRAD[3*iAt + 1]
+                      << std::setw(16) << EXX_GRAD[3*iAt + 2]
+                      << std::endl;


If integrate_exc_grad is enabled but the calculation is UKS/GKS, EXX_GRAD is never populated/resized (only a warning is printed) but is still indexed in the printing loop. This can cause out-of-bounds access/crash. Guard the printout behind the same condition that actually computes EXX_GRAD (RKS), or ensure EXX_GRAD is resized/filled with zeros in the NYI branches.

Copilot · 2026-04-06T21:31:27Z

tests/standalone_driver.cxx

+        if(integrate_exc_grad) {
+        double exx_grad_ref_nrm(0.), exx_grad_calc_nrm(0.), exx_grad_diff_nrm(0.);
+        for( auto i = 0; i < 3*mol.size(); ++i ) {
+          const auto ref_val = EXX_GRAD_ref[i];
+          const auto clc_val = EXX_GRAD[i];
+          const auto dif_val = std::abs(ref_val - clc_val);
+          exx_grad_ref_nrm  += ref_val*ref_val;


The EXX gradient norm comparison assumes EXX_GRAD has length 3*mol.size(). If EXX gradient is NYI for the chosen KS scheme (currently UKS/GKS), EXX_GRAD may be empty and this loop will read past the end. Only compute these norms when EXX_GRAD is available (or fill it with zeros of the correct size in the NYI paths).

Copilot · 2026-04-06T21:31:27Z

tests/standalone_driver.cxx

+        if( integrate_exc_grad ) {
+          HighFive::DataSpace grad_space( mol.size(), 3 );
+          dset = file.createDataSet<double>( "/EXX_GRAD", grad_space );
+          dset.write_raw( EXX_GRAD.data() );
+        }


When writing the output HDF5 file, /EXX_GRAD is written whenever integrate_exx && integrate_exc_grad are true, but EXX_GRAD may be unset/empty in NYI cases (UKS/GKS). This can write invalid data or crash. Only create/write /EXX_GRAD when EXX_GRAD has been computed (and has the expected size).

Copilot · 2026-04-06T21:31:27Z

src/xc_integrator/xc_data/device/xc_device_stack_data.cxx

+  allocate_static_data_exx(nbf, nshells, nshell_pairs, nprim_pair_total, max_l);
+
+  if( allocated_terms.exx_grad )
+    GAUXC_GENERIC_EXCEPTION("Attempting to reallocate Stack EXC GRAD");
+


The reallocation guard in allocate_static_data_exx_grad throws "Attempting to reallocate Stack EXC GRAD", but this function is allocating EXX-gradient-related storage. The message is misleading for debugging; update it to reference EXX_GRAD (or EXX grad) instead.

Copilot · 2026-04-06T21:31:28Z

src/xc_integrator/xc_data/device/xc_device_stack_data.hpp

    double* exx_k_device   = nullptr;  ///< EXX K storage (nbf,nbf)
+    double* exx_kx_device   = nullptr;  ///< EXX dK/dx intermediates storage (nbf,nbf)
+    double* exx_ky_device   = nullptr;  ///< EXX dK/dy intermediates storage (nbf,nbf)
+    double* exx_kz_device   = nullptr;  ///< EXX dK/dz intermediates storage (nbf,nbf)
+    double* exx_bfgrad_device = nullptr;  ///< EXX gradient storage (nbf)
    double* acc_scr_device = nullptr;  ///< Accumulaion scratch (1)


The comment for exx_bfgrad_device says "EXX gradient storage (nbf)", but the allocation/usage is 3*nbf (x/y/z components). Update the comment to avoid confusion about the buffer size/layout.

Copilot · 2026-04-06T21:31:28Z

src/xc_integrator/replicated/host/reference_replicated_xc_host_integrator_exx_grad.hpp

+                 int64_t ldp, value_type* EXC_GRAD,
+                 const IntegratorSettingsEXX& settings ) {
+
+  GAUXC_GENERIC_EXCEPTION("HostReplicated exc_grad NYI" );
+  util::unused(m,n,P,ldp,EXC_GRAD);


This NYI stub is for EXX gradient, but the parameter name (EXC_GRAD) and exception message ("exc_grad") refer to EXC gradient instead. Also settings is unused but not included in util::unused, which can trigger -Wunused-parameter warnings under stricter builds. Rename to EXX_GRAD, update the exception string, and include settings in util::unused.

Suggested change

int64_t ldp, value_type* EXC_GRAD,

const IntegratorSettingsEXX& settings ) {

GAUXC_GENERIC_EXCEPTION("HostReplicated exc_grad NYI" );

util::unused(m,n,P,ldp,EXC_GRAD);

int64_t ldp, value_type* EXX_GRAD,

const IntegratorSettingsEXX& settings ) {

GAUXC_GENERIC_EXCEPTION("HostReplicated exx_grad NYI" );

util::unused(m,n,P,ldp,EXX_GRAD,settings);

Copilot · 2026-04-06T21:31:28Z

include/gauxc/xc_integrator/xc_integrator_impl.hpp

  /** Integrate Exact Exchange for RHF
   *
   *  @param[in] P The alpha density matrix
   *  @returns Excact Exchange Matrix
   */
  exx_type eval_exx( const MatrixType& P, const IntegratorSettingsEXX& settings ) {
    return eval_exx_(P,settings);
  }

+  /** Integrate Exact Exchange nuclear
+   *  derivatives for RHF
+   *
+   *  @param[in] P The alpha density matrix
+   *  @returns Excact Exchange Matrix
+   */
+  exx_grad_type eval_exx_grad( const MatrixType& P, const IntegratorSettingsEXX& settings ) {
+    return eval_exx_grad_(P,settings);
+  }


The new eval_exx_grad Doxygen block has typos and an incorrect return description (it says "Excact Exchange Matrix" but the function returns an EXX gradient vector). Please fix the spelling and update the @returns text to describe the gradient.

Copilot · 2026-04-06T21:31:28Z

src/runtime_environment/device/device_runtime_environment_impl.hpp

+  explicit DeviceRuntimeEnvironmentImpl(GAUXC_MPI_CODE(MPI_Comm c,)
+    size_t nbytes) :
+    DeviceRuntimeEnvironmentImpl(GAUXC_MPI_CODE(c,) nullptr, 0) {
+
+    std::tie( device_memory_, device_memory_size_ ) =
+      device_backend_->allocate_device_buffer(nbytes);
+


DeviceRuntimeEnvironmentImpl passes a size_t nbytes to DeviceBackend::allocate_device_buffer(int64_t). On platforms where size_t is wider than int64_t, large values can overflow/truncate silently. Consider validating nbytes <= std::numeric_limits<int64_t>::max() (and/or changing the backend API to take size_t) before calling into the backend.

Add seminumerical exact exchange for CUDA

96eea8c

Co-authored-by: Lukas Gergs <lg@terraquantum.swiss>

vmitq mentioned this pull request Mar 13, 2026

Add seminumerical exact exchange (EXX) gradient for CUDA #184

Open

awvwgk linked an issue Apr 1, 2026 that may be closed by this pull request

Add seminumerical exact exchange (EXX) gradient for CUDA #184

Open

wavefunction91 requested a review from Copilot April 6, 2026 21:25

Copilot started reviewing on behalf of wavefunction91 April 6, 2026 21:26 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable seminumerical exact exchange calculation with CUDA#183

Enable seminumerical exact exchange calculation with CUDA#183
vmitq wants to merge 1 commit intowavefunction91:masterfrom
vmitq:feature/exx_gradient_cuda

vmitq commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	device_data_ptr->exx_grad_device_data(), nbf, ReductionOp::Sum,
	device_data_ptr->exx_grad_device_data(), 3*nbf, ReductionOp::Sum,

Conversation

vmitq commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants