Skip to content

Enable seminumerical exact exchange calculation with CUDA#183

Open
vmitq wants to merge 1 commit intowavefunction91:masterfrom
vmitq:feature/exx_gradient_cuda
Open

Enable seminumerical exact exchange calculation with CUDA#183
vmitq wants to merge 1 commit intowavefunction91:masterfrom
vmitq:feature/exx_gradient_cuda

Conversation

@vmitq
Copy link
Copy Markdown

@vmitq vmitq commented Mar 13, 2026

This PR implements a new feature: the calculation of the seminumerical exact exchange contribution to the gradient. Currently, the EXX gradient is supported only for CUDA+cuBLAS.

Co-authored-by: Lukas Gergs <lg@terraquantum.swiss>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for computing the seminumerical exact-exchange (EXX) contribution to the nuclear gradient on CUDA device backends, and exposes it through the public XCIntegrator / replicated-integrator APIs.

Changes:

  • Add eval_exx_grad API plumbing across XCIntegrator and replicated integrator layers.
  • Add device storage + local-work driver support for EXX-gradient intermediates and reduction.
  • Extend the standalone driver to load/write/print/compare EXX_GRAD from HDF5 references.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/standalone_driver.cxx Loads/writes/prints EXX_GRAD reference data and compares norms.
src/xc_integrator/xc_data/device/xc_device_stack_data.hpp Adds device pointers and interface hooks for EXX-gradient buffers.
src/xc_integrator/xc_data/device/xc_device_stack_data.cxx Allocates/zeros/retrieves EXX-gradient intermediates on device.
src/xc_integrator/xc_data/device/xc_device_data.hpp Adds exx_grad term tracking and device-data virtual interface for EXX gradient.
src/xc_integrator/shell_batched/shell_batched_replicated_xc_integrator.hpp Declares eval_exx_grad_ in shell-batched replicated integrator.
src/xc_integrator/shell_batched/shell_batched_replicated_xc_integrator_exx_grad.hpp Adds NYI stub for shell-batched EXX gradient.
src/xc_integrator/replicated/replicated_xc_integrator_impl.cxx Adds replicated pimpl forwarding function eval_exx_grad.
src/xc_integrator/replicated/host/shell_batched_replicated_xc_host_integrator.cxx Wires in the shell-batched EXX-gradient stub header.
src/xc_integrator/replicated/host/reference_replicated_xc_host_integrator.hpp Declares eval_exx_grad_ for reference host integrator.
src/xc_integrator/replicated/host/reference_replicated_xc_host_integrator.cxx Wires in the reference EXX-gradient stub header.
src/xc_integrator/replicated/host/reference_replicated_xc_host_integrator_exx_grad.hpp Adds NYI stub for reference host EXX gradient.
src/xc_integrator/replicated/device/shell_batched_replicated_xc_device_integrator.cxx Wires in the shell-batched EXX-gradient stub header on device build.
src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator.hpp Declares EXX-gradient evaluation and local-work helpers.
src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator.cxx Includes the new EXX-gradient device implementation header.
src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator_exx_grad.hpp Implements EXX gradient evaluation workflow on device.
src/xc_integrator/local_work_driver/device/scheme1_magma_base.hpp Adds EXX-gradient driver API surface for MAGMA scheme (NYI).
src/xc_integrator/local_work_driver/device/scheme1_magma_base.cxx Adds NYI throws for EXX-gradient operations under MAGMA.
src/xc_integrator/local_work_driver/device/scheme1_base.hpp Extends scheme1 base interface with EXX-gradient ops.
src/xc_integrator/local_work_driver/device/scheme1_base.cxx Implements EXX K-derivative accumulation and contraction to basis-function gradients.
src/xc_integrator/local_work_driver/device/local_device_work_driver.hpp Exposes EXX-gradient operations on the local device work driver.
src/xc_integrator/local_work_driver/device/local_device_work_driver.cxx Forwards EXX-gradient calls to the device-driver PIMPL.
src/xc_integrator/local_work_driver/device/local_device_work_driver_pimpl.hpp Adds pure-virtual EXX-gradient hooks for device-driver implementations.
src/xc_integrator/local_work_driver/device/cuda/kernels/cublas_extensions.cu Adds CUDA kernels for matrix row/column reductions used in EXX gradient assembly.
src/xc_integrator/local_work_driver/device/common/device_blas.hpp Declares matrix_reduce_rows/cols device BLAS helpers.
src/runtime_environment/device/device_runtime_environment.cxx Adds DeviceRuntimeEnvironment constructor taking an explicit byte count.
src/runtime_environment/device/device_runtime_environment_impl.hpp Implements explicit-size device-buffer allocation constructor.
include/gauxc/xc_integrator/xc_integrator_impl.hpp Adds eval_exx_grad to the virtual integrator interface and public wrapper.
include/gauxc/xc_integrator/replicated/replicated_xc_integrator_impl.hpp Adds low-level replicated EXX-gradient virtual + public entrypoint.
include/gauxc/xc_integrator/replicated/impl.hpp Implements ReplicatedXCIntegrator::eval_exx_grad_ wrapper returning a vector.
include/gauxc/xc_integrator/replicated_xc_integrator.hpp Plumbs eval_exx_grad_ through replicated integrator type.
include/gauxc/xc_integrator/impl.hpp Adds XCIntegrator::eval_exx_grad public wrapper.
include/gauxc/xc_integrator.hpp Adds public eval_exx_grad API type and method declaration.
include/gauxc/runtime_environment/decl.hpp Declares the new explicit-size DeviceRuntimeEnvironment constructor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Reduce results in device memory
this->timer_.time_op("XCIntegrator.Allreduce_EXX_GRAD", [&](){
this->reduction_driver_->allreduce_inplace(
device_data_ptr->exx_grad_device_data(), nbf, ReductionOp::Sum,
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the device-reduction path, the allreduce length is incorrect: exx_grad_device_data() points to a 3*nbf buffer (x/y/z), but the call reduces only nbf elements. This will leave 2/3 of the EXX gradient unreduced across ranks and can produce wrong gradients. Update the reduction size to 3*nbf (and ensure the reduction driver expects that count).

Suggested change
device_data_ptr->exx_grad_device_data(), nbf, ReductionOp::Sum,
device_data_ptr->exx_grad_device_data(), 3*nbf, ReductionOp::Sum,

Copilot uses AI. Check for mistakes.
Comment on lines +110 to +125
// Sum gradient contribution of basis functions
// to nuclei
rt.device_backend()->master_queue_synchronize();
auto& basis_map = this->load_balancer_->basis_map();
for (size_t i = 0; i< basis.nshells(); i++) {
const auto [b0, b1] = basis_map.shell_to_ao_range()[i];
const auto iCenter = basis_map.shell_to_center()[i];
for (size_t bf = b0; bf < b1; bf++) {
// factor 2 is for bra-ket integral symmetry
EXX_GRAD[3*iCenter ] += 2*exx_bfgrad[bf ];
EXX_GRAD[3*iCenter+1] += 2*exx_bfgrad[bf+1*nbf];
EXX_GRAD[3*iCenter+2] += 2*exx_bfgrad[bf+2*nbf];
}

}
}
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_exx_grad_ accumulates into the caller-provided EXX_GRAD buffer using += without first zeroing/initializing it. This makes the result dependent on the initial contents of EXX_GRAD and can cause incorrect gradients if the buffer is reused. Consider explicitly zeroing EXX_GRAD[0..3*natoms) at the start of the routine (or switching to assignment) to match the typical output-parameter contract.

Copilot uses AI. Check for mistakes.
Comment on lines +608 to +626
if( integrate_exc_grad ) {
if( rks ) {
EXX_GRAD = integrator.eval_exx_grad( P, sn_link_settings );
}
else if( uks ) {
std::cout << "Warning: eval_exx_grad + UKS NYI!" << std::endl;
}
else if( gks ) {
std::cout << "Warning: eval_exx_grad + GKS NYI!" << std::endl;
}
if(!world_rank) {
std::cout << "EXX Gradient:" << std::endl;
std::cout << std::scientific << std::setprecision(6);
for( auto iAt = 0; iAt < mol.size(); ++iAt ) {
std::cout << " "
<< std::setw(16) << EXX_GRAD[3*iAt + 0]
<< std::setw(16) << EXX_GRAD[3*iAt + 1]
<< std::setw(16) << EXX_GRAD[3*iAt + 2]
<< std::endl;
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If integrate_exc_grad is enabled but the calculation is UKS/GKS, EXX_GRAD is never populated/resized (only a warning is printed) but is still indexed in the printing loop. This can cause out-of-bounds access/crash. Guard the printout behind the same condition that actually computes EXX_GRAD (RKS), or ensure EXX_GRAD is resized/filled with zeros in the NYI branches.

Copilot uses AI. Check for mistakes.
Comment on lines +780 to +786
if(integrate_exc_grad) {
double exx_grad_ref_nrm(0.), exx_grad_calc_nrm(0.), exx_grad_diff_nrm(0.);
for( auto i = 0; i < 3*mol.size(); ++i ) {
const auto ref_val = EXX_GRAD_ref[i];
const auto clc_val = EXX_GRAD[i];
const auto dif_val = std::abs(ref_val - clc_val);
exx_grad_ref_nrm += ref_val*ref_val;
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EXX gradient norm comparison assumes EXX_GRAD has length 3*mol.size(). If EXX gradient is NYI for the chosen KS scheme (currently UKS/GKS), EXX_GRAD may be empty and this loop will read past the end. Only compute these norms when EXX_GRAD is available (or fill it with zeros of the correct size in the NYI paths).

Copilot uses AI. Check for mistakes.
Comment on lines +877 to +881
if( integrate_exc_grad ) {
HighFive::DataSpace grad_space( mol.size(), 3 );
dset = file.createDataSet<double>( "/EXX_GRAD", grad_space );
dset.write_raw( EXX_GRAD.data() );
}
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing the output HDF5 file, /EXX_GRAD is written whenever integrate_exx && integrate_exc_grad are true, but EXX_GRAD may be unset/empty in NYI cases (UKS/GKS). This can write invalid data or crash. Only create/write /EXX_GRAD when EXX_GRAD has been computed (and has the expected size).

Copilot uses AI. Check for mistakes.
Comment on lines +279 to +283
allocate_static_data_exx(nbf, nshells, nshell_pairs, nprim_pair_total, max_l);

if( allocated_terms.exx_grad )
GAUXC_GENERIC_EXCEPTION("Attempting to reallocate Stack EXC GRAD");

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reallocation guard in allocate_static_data_exx_grad throws "Attempting to reallocate Stack EXC GRAD", but this function is allocating EXX-gradient-related storage. The message is misleading for debugging; update it to reference EXX_GRAD (or EXX grad) instead.

Copilot uses AI. Check for mistakes.
Comment on lines 59 to 64
double* exx_k_device = nullptr; ///< EXX K storage (nbf,nbf)
double* exx_kx_device = nullptr; ///< EXX dK/dx intermediates storage (nbf,nbf)
double* exx_ky_device = nullptr; ///< EXX dK/dy intermediates storage (nbf,nbf)
double* exx_kz_device = nullptr; ///< EXX dK/dz intermediates storage (nbf,nbf)
double* exx_bfgrad_device = nullptr; ///< EXX gradient storage (nbf)
double* acc_scr_device = nullptr; ///< Accumulaion scratch (1)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for exx_bfgrad_device says "EXX gradient storage (nbf)", but the allocation/usage is 3*nbf (x/y/z components). Update the comment to avoid confusion about the buffer size/layout.

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +23
int64_t ldp, value_type* EXC_GRAD,
const IntegratorSettingsEXX& settings ) {

GAUXC_GENERIC_EXCEPTION("HostReplicated exc_grad NYI" );
util::unused(m,n,P,ldp,EXC_GRAD);
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This NYI stub is for EXX gradient, but the parameter name (EXC_GRAD) and exception message ("exc_grad") refer to EXC gradient instead. Also settings is unused but not included in util::unused, which can trigger -Wunused-parameter warnings under stricter builds. Rename to EXX_GRAD, update the exception string, and include settings in util::unused.

Suggested change
int64_t ldp, value_type* EXC_GRAD,
const IntegratorSettingsEXX& settings ) {
GAUXC_GENERIC_EXCEPTION("HostReplicated exc_grad NYI" );
util::unused(m,n,P,ldp,EXC_GRAD);
int64_t ldp, value_type* EXX_GRAD,
const IntegratorSettingsEXX& settings ) {
GAUXC_GENERIC_EXCEPTION("HostReplicated exx_grad NYI" );
util::unused(m,n,P,ldp,EXX_GRAD,settings);

Copilot uses AI. Check for mistakes.
Comment on lines 148 to +165
/** Integrate Exact Exchange for RHF
*
* @param[in] P The alpha density matrix
* @returns Excact Exchange Matrix
*/
exx_type eval_exx( const MatrixType& P, const IntegratorSettingsEXX& settings ) {
return eval_exx_(P,settings);
}

/** Integrate Exact Exchange nuclear
* derivatives for RHF
*
* @param[in] P The alpha density matrix
* @returns Excact Exchange Matrix
*/
exx_grad_type eval_exx_grad( const MatrixType& P, const IntegratorSettingsEXX& settings ) {
return eval_exx_grad_(P,settings);
}
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new eval_exx_grad Doxygen block has typos and an incorrect return description (it says "Excact Exchange Matrix" but the function returns an EXX gradient vector). Please fix the spelling and update the @returns text to describe the gradient.

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +62
explicit DeviceRuntimeEnvironmentImpl(GAUXC_MPI_CODE(MPI_Comm c,)
size_t nbytes) :
DeviceRuntimeEnvironmentImpl(GAUXC_MPI_CODE(c,) nullptr, 0) {

std::tie( device_memory_, device_memory_size_ ) =
device_backend_->allocate_device_buffer(nbytes);

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeviceRuntimeEnvironmentImpl passes a size_t nbytes to DeviceBackend::allocate_device_buffer(int64_t). On platforms where size_t is wider than int64_t, large values can overflow/truncate silently. Consider validating nbytes <= std::numeric_limits<int64_t>::max() (and/or changing the backend API to take size_t) before calling into the backend.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add seminumerical exact exchange (EXX) gradient for CUDA

2 participants