Enable seminumerical exact exchange calculation with CUDA#183
Enable seminumerical exact exchange calculation with CUDA#183vmitq wants to merge 1 commit intowavefunction91:masterfrom
Conversation
Co-authored-by: Lukas Gergs <lg@terraquantum.swiss>
There was a problem hiding this comment.
Pull request overview
This PR adds support for computing the seminumerical exact-exchange (EXX) contribution to the nuclear gradient on CUDA device backends, and exposes it through the public XCIntegrator / replicated-integrator APIs.
Changes:
- Add
eval_exx_gradAPI plumbing acrossXCIntegratorand replicated integrator layers. - Add device storage + local-work driver support for EXX-gradient intermediates and reduction.
- Extend the standalone driver to load/write/print/compare
EXX_GRADfrom HDF5 references.
Reviewed changes
Copilot reviewed 33 out of 33 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/standalone_driver.cxx | Loads/writes/prints EXX_GRAD reference data and compares norms. |
| src/xc_integrator/xc_data/device/xc_device_stack_data.hpp | Adds device pointers and interface hooks for EXX-gradient buffers. |
| src/xc_integrator/xc_data/device/xc_device_stack_data.cxx | Allocates/zeros/retrieves EXX-gradient intermediates on device. |
| src/xc_integrator/xc_data/device/xc_device_data.hpp | Adds exx_grad term tracking and device-data virtual interface for EXX gradient. |
| src/xc_integrator/shell_batched/shell_batched_replicated_xc_integrator.hpp | Declares eval_exx_grad_ in shell-batched replicated integrator. |
| src/xc_integrator/shell_batched/shell_batched_replicated_xc_integrator_exx_grad.hpp | Adds NYI stub for shell-batched EXX gradient. |
| src/xc_integrator/replicated/replicated_xc_integrator_impl.cxx | Adds replicated pimpl forwarding function eval_exx_grad. |
| src/xc_integrator/replicated/host/shell_batched_replicated_xc_host_integrator.cxx | Wires in the shell-batched EXX-gradient stub header. |
| src/xc_integrator/replicated/host/reference_replicated_xc_host_integrator.hpp | Declares eval_exx_grad_ for reference host integrator. |
| src/xc_integrator/replicated/host/reference_replicated_xc_host_integrator.cxx | Wires in the reference EXX-gradient stub header. |
| src/xc_integrator/replicated/host/reference_replicated_xc_host_integrator_exx_grad.hpp | Adds NYI stub for reference host EXX gradient. |
| src/xc_integrator/replicated/device/shell_batched_replicated_xc_device_integrator.cxx | Wires in the shell-batched EXX-gradient stub header on device build. |
| src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator.hpp | Declares EXX-gradient evaluation and local-work helpers. |
| src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator.cxx | Includes the new EXX-gradient device implementation header. |
| src/xc_integrator/replicated/device/incore_replicated_xc_device_integrator_exx_grad.hpp | Implements EXX gradient evaluation workflow on device. |
| src/xc_integrator/local_work_driver/device/scheme1_magma_base.hpp | Adds EXX-gradient driver API surface for MAGMA scheme (NYI). |
| src/xc_integrator/local_work_driver/device/scheme1_magma_base.cxx | Adds NYI throws for EXX-gradient operations under MAGMA. |
| src/xc_integrator/local_work_driver/device/scheme1_base.hpp | Extends scheme1 base interface with EXX-gradient ops. |
| src/xc_integrator/local_work_driver/device/scheme1_base.cxx | Implements EXX K-derivative accumulation and contraction to basis-function gradients. |
| src/xc_integrator/local_work_driver/device/local_device_work_driver.hpp | Exposes EXX-gradient operations on the local device work driver. |
| src/xc_integrator/local_work_driver/device/local_device_work_driver.cxx | Forwards EXX-gradient calls to the device-driver PIMPL. |
| src/xc_integrator/local_work_driver/device/local_device_work_driver_pimpl.hpp | Adds pure-virtual EXX-gradient hooks for device-driver implementations. |
| src/xc_integrator/local_work_driver/device/cuda/kernels/cublas_extensions.cu | Adds CUDA kernels for matrix row/column reductions used in EXX gradient assembly. |
| src/xc_integrator/local_work_driver/device/common/device_blas.hpp | Declares matrix_reduce_rows/cols device BLAS helpers. |
| src/runtime_environment/device/device_runtime_environment.cxx | Adds DeviceRuntimeEnvironment constructor taking an explicit byte count. |
| src/runtime_environment/device/device_runtime_environment_impl.hpp | Implements explicit-size device-buffer allocation constructor. |
| include/gauxc/xc_integrator/xc_integrator_impl.hpp | Adds eval_exx_grad to the virtual integrator interface and public wrapper. |
| include/gauxc/xc_integrator/replicated/replicated_xc_integrator_impl.hpp | Adds low-level replicated EXX-gradient virtual + public entrypoint. |
| include/gauxc/xc_integrator/replicated/impl.hpp | Implements ReplicatedXCIntegrator::eval_exx_grad_ wrapper returning a vector. |
| include/gauxc/xc_integrator/replicated_xc_integrator.hpp | Plumbs eval_exx_grad_ through replicated integrator type. |
| include/gauxc/xc_integrator/impl.hpp | Adds XCIntegrator::eval_exx_grad public wrapper. |
| include/gauxc/xc_integrator.hpp | Adds public eval_exx_grad API type and method declaration. |
| include/gauxc/runtime_environment/decl.hpp | Declares the new explicit-size DeviceRuntimeEnvironment constructor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Reduce results in device memory | ||
| this->timer_.time_op("XCIntegrator.Allreduce_EXX_GRAD", [&](){ | ||
| this->reduction_driver_->allreduce_inplace( | ||
| device_data_ptr->exx_grad_device_data(), nbf, ReductionOp::Sum, |
There was a problem hiding this comment.
In the device-reduction path, the allreduce length is incorrect: exx_grad_device_data() points to a 3*nbf buffer (x/y/z), but the call reduces only nbf elements. This will leave 2/3 of the EXX gradient unreduced across ranks and can produce wrong gradients. Update the reduction size to 3*nbf (and ensure the reduction driver expects that count).
| device_data_ptr->exx_grad_device_data(), nbf, ReductionOp::Sum, | |
| device_data_ptr->exx_grad_device_data(), 3*nbf, ReductionOp::Sum, |
| // Sum gradient contribution of basis functions | ||
| // to nuclei | ||
| rt.device_backend()->master_queue_synchronize(); | ||
| auto& basis_map = this->load_balancer_->basis_map(); | ||
| for (size_t i = 0; i< basis.nshells(); i++) { | ||
| const auto [b0, b1] = basis_map.shell_to_ao_range()[i]; | ||
| const auto iCenter = basis_map.shell_to_center()[i]; | ||
| for (size_t bf = b0; bf < b1; bf++) { | ||
| // factor 2 is for bra-ket integral symmetry | ||
| EXX_GRAD[3*iCenter ] += 2*exx_bfgrad[bf ]; | ||
| EXX_GRAD[3*iCenter+1] += 2*exx_bfgrad[bf+1*nbf]; | ||
| EXX_GRAD[3*iCenter+2] += 2*exx_bfgrad[bf+2*nbf]; | ||
| } | ||
|
|
||
| } | ||
| } |
There was a problem hiding this comment.
eval_exx_grad_ accumulates into the caller-provided EXX_GRAD buffer using += without first zeroing/initializing it. This makes the result dependent on the initial contents of EXX_GRAD and can cause incorrect gradients if the buffer is reused. Consider explicitly zeroing EXX_GRAD[0..3*natoms) at the start of the routine (or switching to assignment) to match the typical output-parameter contract.
| if( integrate_exc_grad ) { | ||
| if( rks ) { | ||
| EXX_GRAD = integrator.eval_exx_grad( P, sn_link_settings ); | ||
| } | ||
| else if( uks ) { | ||
| std::cout << "Warning: eval_exx_grad + UKS NYI!" << std::endl; | ||
| } | ||
| else if( gks ) { | ||
| std::cout << "Warning: eval_exx_grad + GKS NYI!" << std::endl; | ||
| } | ||
| if(!world_rank) { | ||
| std::cout << "EXX Gradient:" << std::endl; | ||
| std::cout << std::scientific << std::setprecision(6); | ||
| for( auto iAt = 0; iAt < mol.size(); ++iAt ) { | ||
| std::cout << " " | ||
| << std::setw(16) << EXX_GRAD[3*iAt + 0] | ||
| << std::setw(16) << EXX_GRAD[3*iAt + 1] | ||
| << std::setw(16) << EXX_GRAD[3*iAt + 2] | ||
| << std::endl; |
There was a problem hiding this comment.
If integrate_exc_grad is enabled but the calculation is UKS/GKS, EXX_GRAD is never populated/resized (only a warning is printed) but is still indexed in the printing loop. This can cause out-of-bounds access/crash. Guard the printout behind the same condition that actually computes EXX_GRAD (RKS), or ensure EXX_GRAD is resized/filled with zeros in the NYI branches.
| if(integrate_exc_grad) { | ||
| double exx_grad_ref_nrm(0.), exx_grad_calc_nrm(0.), exx_grad_diff_nrm(0.); | ||
| for( auto i = 0; i < 3*mol.size(); ++i ) { | ||
| const auto ref_val = EXX_GRAD_ref[i]; | ||
| const auto clc_val = EXX_GRAD[i]; | ||
| const auto dif_val = std::abs(ref_val - clc_val); | ||
| exx_grad_ref_nrm += ref_val*ref_val; |
There was a problem hiding this comment.
The EXX gradient norm comparison assumes EXX_GRAD has length 3*mol.size(). If EXX gradient is NYI for the chosen KS scheme (currently UKS/GKS), EXX_GRAD may be empty and this loop will read past the end. Only compute these norms when EXX_GRAD is available (or fill it with zeros of the correct size in the NYI paths).
| if( integrate_exc_grad ) { | ||
| HighFive::DataSpace grad_space( mol.size(), 3 ); | ||
| dset = file.createDataSet<double>( "/EXX_GRAD", grad_space ); | ||
| dset.write_raw( EXX_GRAD.data() ); | ||
| } |
There was a problem hiding this comment.
When writing the output HDF5 file, /EXX_GRAD is written whenever integrate_exx && integrate_exc_grad are true, but EXX_GRAD may be unset/empty in NYI cases (UKS/GKS). This can write invalid data or crash. Only create/write /EXX_GRAD when EXX_GRAD has been computed (and has the expected size).
| allocate_static_data_exx(nbf, nshells, nshell_pairs, nprim_pair_total, max_l); | ||
|
|
||
| if( allocated_terms.exx_grad ) | ||
| GAUXC_GENERIC_EXCEPTION("Attempting to reallocate Stack EXC GRAD"); | ||
|
|
There was a problem hiding this comment.
The reallocation guard in allocate_static_data_exx_grad throws "Attempting to reallocate Stack EXC GRAD", but this function is allocating EXX-gradient-related storage. The message is misleading for debugging; update it to reference EXX_GRAD (or EXX grad) instead.
| double* exx_k_device = nullptr; ///< EXX K storage (nbf,nbf) | ||
| double* exx_kx_device = nullptr; ///< EXX dK/dx intermediates storage (nbf,nbf) | ||
| double* exx_ky_device = nullptr; ///< EXX dK/dy intermediates storage (nbf,nbf) | ||
| double* exx_kz_device = nullptr; ///< EXX dK/dz intermediates storage (nbf,nbf) | ||
| double* exx_bfgrad_device = nullptr; ///< EXX gradient storage (nbf) | ||
| double* acc_scr_device = nullptr; ///< Accumulaion scratch (1) |
There was a problem hiding this comment.
The comment for exx_bfgrad_device says "EXX gradient storage (nbf)", but the allocation/usage is 3*nbf (x/y/z components). Update the comment to avoid confusion about the buffer size/layout.
| int64_t ldp, value_type* EXC_GRAD, | ||
| const IntegratorSettingsEXX& settings ) { | ||
|
|
||
| GAUXC_GENERIC_EXCEPTION("HostReplicated exc_grad NYI" ); | ||
| util::unused(m,n,P,ldp,EXC_GRAD); |
There was a problem hiding this comment.
This NYI stub is for EXX gradient, but the parameter name (EXC_GRAD) and exception message ("exc_grad") refer to EXC gradient instead. Also settings is unused but not included in util::unused, which can trigger -Wunused-parameter warnings under stricter builds. Rename to EXX_GRAD, update the exception string, and include settings in util::unused.
| int64_t ldp, value_type* EXC_GRAD, | |
| const IntegratorSettingsEXX& settings ) { | |
| GAUXC_GENERIC_EXCEPTION("HostReplicated exc_grad NYI" ); | |
| util::unused(m,n,P,ldp,EXC_GRAD); | |
| int64_t ldp, value_type* EXX_GRAD, | |
| const IntegratorSettingsEXX& settings ) { | |
| GAUXC_GENERIC_EXCEPTION("HostReplicated exx_grad NYI" ); | |
| util::unused(m,n,P,ldp,EXX_GRAD,settings); |
| /** Integrate Exact Exchange for RHF | ||
| * | ||
| * @param[in] P The alpha density matrix | ||
| * @returns Excact Exchange Matrix | ||
| */ | ||
| exx_type eval_exx( const MatrixType& P, const IntegratorSettingsEXX& settings ) { | ||
| return eval_exx_(P,settings); | ||
| } | ||
|
|
||
| /** Integrate Exact Exchange nuclear | ||
| * derivatives for RHF | ||
| * | ||
| * @param[in] P The alpha density matrix | ||
| * @returns Excact Exchange Matrix | ||
| */ | ||
| exx_grad_type eval_exx_grad( const MatrixType& P, const IntegratorSettingsEXX& settings ) { | ||
| return eval_exx_grad_(P,settings); | ||
| } |
There was a problem hiding this comment.
The new eval_exx_grad Doxygen block has typos and an incorrect return description (it says "Excact Exchange Matrix" but the function returns an EXX gradient vector). Please fix the spelling and update the @returns text to describe the gradient.
| explicit DeviceRuntimeEnvironmentImpl(GAUXC_MPI_CODE(MPI_Comm c,) | ||
| size_t nbytes) : | ||
| DeviceRuntimeEnvironmentImpl(GAUXC_MPI_CODE(c,) nullptr, 0) { | ||
|
|
||
| std::tie( device_memory_, device_memory_size_ ) = | ||
| device_backend_->allocate_device_buffer(nbytes); | ||
|
|
There was a problem hiding this comment.
DeviceRuntimeEnvironmentImpl passes a size_t nbytes to DeviceBackend::allocate_device_buffer(int64_t). On platforms where size_t is wider than int64_t, large values can overflow/truncate silently. Consider validating nbytes <= std::numeric_limits<int64_t>::max() (and/or changing the backend API to take size_t) before calling into the backend.
This PR implements a new feature: the calculation of the seminumerical exact exchange contribution to the gradient. Currently, the EXX gradient is supported only for CUDA+cuBLAS.