Improve CUDA resource management for MPI jobs by vmitq · Pull Request #185 · wavefunction91/GauXC

vmitq · 2026-03-13T13:12:52Z

The code detects the number of local MPI processes and available CUDA devices and assigns a GPU ID to each process in a round-robin fashion. When determining the available memory, it is divided evenly among the processes sharing the same GPU.

That simplifies GPU resource management when running jobs with multiple GPUs per host or multiple processes per GPU.

Split memory evenly between processes on one GPU

Copilot

Pull request overview

This PR updates device-backend initialization to become MPI-aware (when built with MPI), enabling per-node GPU assignment in a round-robin manner and adjusting reported available GPU memory for multi-process-per-GPU runs.

Changes:

Extend make_device_backend to accept an MPI communicator (when MPI is enabled) and propagate it from the device runtime environment.
Add an MPI-aware CUDABackend constructor that selects a GPU based on local shared-memory rank.
Adjust CUDABackend::get_available_mem() to scale down available memory based on local process/device sharing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`src/runtime_environment/device/hip/hip_backend.cxx`	Updates backend factory signature to accept an MPI communicator (currently unused).
`src/runtime_environment/device/device_runtime_environment_impl.hpp`	Passes the runtime communicator into `make_device_backend` (MPI builds).
`src/runtime_environment/device/device_backend.hpp`	Updates factory declaration to accept MPI communicator; adds include to access MPI types/macros.
`src/runtime_environment/device/cuda/cuda_backend.hpp`	Adds MPI-related state and an MPI-aware constructor to `CUDABackend`.
`src/runtime_environment/device/cuda/cuda_backend.cxx`	Implements MPI-aware CUDA init (local rank → GPU) and memory-splitting logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T21:30:48Z

src/runtime_environment/device/cuda/cuda_backend.cxx

+  comm = c;
+  MPI_Comm_split_type(comm, MPI_COMM_TYPE_SHARED, 0,
+                      MPI_INFO_NULL, &local_comm);
+  MPI_Comm_size(local_comm, &local_size);
+  MPI_Comm_rank(local_comm, &local_rank);


MPI_Comm_split_type creates local_comm, but the communicator is never freed. Since CUDABackend::~CUDABackend() is defaulted, this will leak MPI communicators over the lifetime of the process (and can become problematic if backends are created/destroyed multiple times). Consider calling MPI_Comm_free(&local_comm) in the destructor when local_comm != MPI_COMM_NULL (and similarly guard/free any other duplicated/split comms you introduce).

Copilot · 2026-04-06T21:30:48Z

src/runtime_environment/device/cuda/cuda_backend.cxx

+  int ndev;
+  auto stat = cudaGetDeviceCount(&ndev);
+  GAUXC_CUDA_ERROR("CUDA backend init failed", stat);
+  gpuid = local_rank % ndev;
+  cudaSetDevice(gpuid);
+


gpuid = local_rank % ndev will invoke undefined behavior if ndev is 0. Even if cudaGetDeviceCount typically errors when no devices are present, it’s safer to explicitly handle ndev <= 0 (raise a clear exception) before the modulo. Also, cudaSetDevice(gpuid) returns a status that should be checked/handled via GAUXC_CUDA_ERROR for consistent error reporting.

Copilot · 2026-04-06T21:30:48Z

src/runtime_environment/device/cuda/cuda_backend.cxx

+  int ndev;
+  stat = cudaGetDeviceCount(&ndev);
+  GAUXC_CUDA_ERROR("MemInfo Failed while getting number of devices", stat);
+  double factor = 1.0 / ((local_size - 1) / ndev + 1);
+  factor = (factor > 1.0 ? 1.0 : factor);
+  cuda_avail = size_t(cuda_avail * factor);


The memory-splitting logic uses ceil(local_size / ndev) (via ((local_size - 1) / ndev + 1)) for every rank, which does not equal “processes sharing the same GPU” under round-robin assignment when local_size % ndev != 0 (some GPUs have fewer ranks). This contradicts the PR description and can under-allocate usable memory on lightly loaded GPUs. To match the stated behavior, compute the exact number of local ranks mapped to this gpuid (e.g., MPI_Comm_split(local_comm, gpuid, local_rank, &gpu_comm) and use MPI_Comm_size(gpu_comm, &n_on_gpu)) and divide by that.

Copilot · 2026-04-06T21:30:49Z

src/runtime_environment/device/cuda/cuda_backend.cxx

+  double factor = 1.0 / ((local_size - 1) / ndev + 1);
+  factor = (factor > 1.0 ? 1.0 : factor);
+  cuda_avail = size_t(cuda_avail * factor);
+  MPI_Barrier(local_comm);


MPI_Barrier(local_comm) inside get_available_mem() introduces a mandatory synchronization in what is otherwise a local query. This can become a scalability bottleneck and can also deadlock if some ranks don’t call get_available_mem() on the same code path (or if local_comm is MPI_COMM_NULL). Consider removing the barrier, or moving any required synchronization to a higher-level initialization phase where collective behavior is guaranteed.

Suggested change

MPI_Barrier(local_comm);

Copilot · 2026-04-06T21:30:49Z

src/runtime_environment/device/device_backend.hpp

 #include "device_queue.hpp"
 #include "device_blas_handle.hpp"
 #include <gauxc/gauxc_config.hpp>
+#include "gauxc/runtime_environment.hpp"


Including the full public gauxc/runtime_environment.hpp here appears unnecessary for declaring make_device_backend(GAUXC_MPI_CODE(MPI_Comm)) and adds extra coupling/compile-time cost. A narrower include like <gauxc/util/mpi.hpp> (for GAUXC_MPI_CODE and MPI_Comm) would avoid pulling in the runtime-environment API from this low-level backend header.

Suggested change

#include "gauxc/runtime_environment.hpp"

#include <gauxc/util/mpi.hpp>

Copilot · 2026-04-06T21:30:49Z

src/runtime_environment/device/hip/hip_backend.cxx


-std::unique_ptr<DeviceBackend> make_device_backend() {
+std::unique_ptr<DeviceBackend> make_device_backend(GAUXC_MPI_CODE(MPI_Comm c))
+{


make_device_backend(GAUXC_MPI_CODE(MPI_Comm c)) ignores c for the HIP backend. In MPI-enabled builds this can trigger -Wunused-parameter (often treated as an error in CI). Consider marking the parameter as unused (e.g., (void)c; under #ifdef GAUXC_HAS_MPI or [[maybe_unused]]) or only adding the parameter on backends that actually use it.

Suggested change

{

{

#ifdef GAUXC_HAS_MPI

(void) c;

#endif

Enable CUDA GPU oversubscription

6aed482

Split memory evenly between processes on one GPU

wavefunction91 requested a review from Copilot April 6, 2026 21:26

Copilot started reviewing on behalf of wavefunction91 April 6, 2026 21:26 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CUDA resource management for MPI jobs#185

Improve CUDA resource management for MPI jobs#185
vmitq wants to merge 1 commit intowavefunction91:masterfrom
vmitq:feature/cuda-oversubscribe

vmitq commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	#include "gauxc/runtime_environment.hpp"
	#include <gauxc/util/mpi.hpp>

-{
+{
+#ifdef GAUXC_HAS_MPI
+  (void) c;
+#endif

Conversation

vmitq commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants