Skip to content

Fix CUDA Graph Capture/Sync race#1250

Closed
akifcorduk wants to merge 5 commits into
NVIDIA:mainfrom
akifcorduk:test_main_fix
Closed

Fix CUDA Graph Capture/Sync race#1250
akifcorduk wants to merge 5 commits into
NVIDIA:mainfrom
akifcorduk:test_main_fix

Conversation

@akifcorduk
Copy link
Copy Markdown
Contributor

run_concurrent in cpp/src/pdlp/solve.cu constructed the barrier raft::handle_t inside the spawned barrier std::thread, concurrently with PDLP's graph capture on the main thread. First-use init of cuBLAS / cuSPARSE / cuSolverDn / cuDSS triggers synchronous cudaMalloc / cudaDeviceSynchronize under the hood, which invalidates the in-flight capture.

We now construct barrier_handle_ptr on the main thread. Init cudssCreate+cudssDestroy on the main thread to force the CUDA driver's lazy cuDSS module load to settle outside the capture window.
std::promise handshake so run_pdlp only starts after the barrier worker is scheduled. Promise is set unconditionally (including in the catch block) so the main thread cannot deadlock.

Closes #1185

@akifcorduk akifcorduk added this to the 26.06 milestone May 20, 2026
@akifcorduk akifcorduk requested a review from a team as a code owner May 20, 2026 00:16
@akifcorduk akifcorduk added the bug Something isn't working label May 20, 2026
@akifcorduk akifcorduk requested review from aliceb-nv and mlubin May 20, 2026 00:16
@akifcorduk akifcorduk added the non-breaking Introduces a non-breaking change label May 20, 2026
@akifcorduk akifcorduk requested a review from a team as a code owner May 20, 2026 00:18
@akifcorduk akifcorduk requested review from hlinsen, rg20 and tmckayus and removed request for aliceb-nv May 20, 2026 00:18
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9ac7957e-86ff-451a-9284-8e9e94803524

📥 Commits

Reviewing files that changed from the base of the PR and between 7ba6dab and ba92445.

📒 Files selected for processing (1)
  • cpp/src/pdlp/solve.cu

📝 Walkthrough

Walkthrough

Barrier-thread initialization in the concurrent PDLP solver is reordered: the barrier-side raft::handle_t is created on the main thread, cuDSS warmup is run, and a promise/future readiness handshake is added so the main thread waits before starting PDLP CUDA graph capture. Separately, swath1 incumbent-callback test cases are no longer skipped.

Changes

CUDA Graph Capture Synchronization

Layer / File(s) Summary
Barrier thread synchronization and cuDSS warmup
cpp/src/pdlp/solve.cu
Adds <atomic> and <future> headers; constructs barrier-side raft::handle_t on the main thread; performs cuDSS warmup (including per-device warmup when num_gpus > 1); introduces a std::promise/std::future + std::atomic readiness signal set by the barrier thread after establishing device context or in its exception path.
Main thread PDLP capture gate
cpp/src/pdlp/solve.cu
Main thread waits on barrier_ready_f.wait() before calling run_pdlp, ensuring CUDA graph capture begins only after the barrier thread signals readiness.

Incumbent callback tests

Layer / File(s) Summary
Enable swath1 incumbent callback tests
python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py, cpp/tests/mip/incumbent_callback_test.cu
Remove special-case pytest.param(..., marks=_SWATH1_GRAPH_CAPTURE_SKIP) for "/mip/swath1.mps" and add mip/swath1.mps to the C++ test instance lists so both Python and C++ incumbent-callback tests include the swath1 dataset.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/cuopt#1208: Related edits to incumbent-callback test handling for /mip/swath1.mps.
  • NVIDIA/cuopt#1206: Also modifies run_concurrent worker/thread coordination and exception/cleanup handling.

Suggested reviewers

  • mlubin
  • hlinsen
  • rg20
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main issue being fixed: a race condition between CUDA graph capture and synchronization in the concurrent solver barrier initialization.
Description check ✅ Passed The description provides clear context on the root cause (concurrent handle construction during graph capture) and specific remedies (main-thread construction, cuDSS warm-up, promise handshake).
Linked Issues check ✅ Passed The PR fully addresses issue #1185 by fixing the CUDA graph capture race condition in solve.cu and re-enabling the previously skipped swath1 regression tests in both C++ and Python test files.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the CUDA graph capture race condition (solve.cu) and re-enabling related regression tests (test files), with no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/pdlp/solve.cu (1)

1545-1563: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard the pre-capture setup with thread cleanup.

dual_simplex_thread can already be running by the time barrier_handle_ptr construction and warmup_cudss() execute. If any of that setup throws, unwinding will destroy a joinable std::thread and hard-abort via std::terminate() before the error can propagate cleanly. Please either defer spawning until all throwing setup is done, or add an RAII joiner/scope guard for both worker threads. As per coding guidelines, "Flag missing RAII in exception paths since cuOpt uses exceptions" and "Flag deadlock potential in resource acquisition patterns".

Also applies to: 1616-1630

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/solve.cu` around lines 1545 - 1563, The code spawns
dual_simplex_thread before later setup (barrier_handle_ptr construction and
warmup_cudss()) that may throw, risking std::terminate on unwind; fix by moving
thread creation until after all potentially-throwing setup completes or by
adding an RAII thread-join guard to ensure any joinable std::thread is joined in
destructors. Concretely: either defer launching dual_simplex_thread (and the
analogous other worker thread) until after barrier_handle_ptr is constructed and
warmup_cudss() returns, or introduce a small scope guard/type (e.g.,
ThreadJoinGuard) that takes references to dual_simplex_thread (and the other
thread) and joins in its destructor so exceptions during subsequent setup won’t
abort; ensure run_dual_simplex_thread and dual_simplex_thread usage remains
unchanged aside from creation timing or guarding.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1545-1563: The code spawns dual_simplex_thread before later setup
(barrier_handle_ptr construction and warmup_cudss()) that may throw, risking
std::terminate on unwind; fix by moving thread creation until after all
potentially-throwing setup completes or by adding an RAII thread-join guard to
ensure any joinable std::thread is joined in destructors. Concretely: either
defer launching dual_simplex_thread (and the analogous other worker thread)
until after barrier_handle_ptr is constructed and warmup_cudss() returns, or
introduce a small scope guard/type (e.g., ThreadJoinGuard) that takes references
to dual_simplex_thread (and the other thread) and joins in its destructor so
exceptions during subsequent setup won’t abort; ensure run_dual_simplex_thread
and dual_simplex_thread usage remains unchanged aside from creation timing or
guarding.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 79004054-042e-47a2-9469-038774f1a555

📥 Commits

Reviewing files that changed from the base of the PR and between 415b12d and ecef670.

📒 Files selected for processing (1)
  • cpp/src/pdlp/solve.cu

@akifcorduk akifcorduk added the P0 label May 20, 2026
@mlubin
Copy link
Copy Markdown
Contributor

mlubin commented May 20, 2026

Please also re-enable the C++ tests in cpp/tests/mip/incumbent_callback_test.cu (search "swath1 is temporarily disabled").

Comment thread cpp/src/pdlp/solve.cu
raft::device_setter device_setter(1);
CUOPT_LOG_DEBUG("Barrier device: %d", device_setter.get_current_device());
barrier_handle_ptr = std::make_unique<raft::handle_t>();
warmup_cudss();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to explicitly call get_cublas_handle etc. Raft does lazy init https://github.com/rapidsai/raft/blob/main/cpp/include/raft/core/resource/cublas_handle.hpp#L62

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good catch Hugo! I wonder how many of those lazy inits we have. I think this will get obfuscated if we need to call check the lazy inits of every library call. I am working on a permanent fix, if that works I will close this PR and we don't need to do this warm start for each library resource.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying this: #1253 if this works, we don't need this PR and it is a permanent fix.

Copy link
Copy Markdown
Contributor

@Kh4ster Kh4ster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Akif!

@akifcorduk
Copy link
Copy Markdown
Contributor Author

Closing the PR because of the permanent fix another PR: #1253

@akifcorduk akifcorduk closed this May 20, 2026
rapids-bot Bot pushed a commit that referenced this pull request May 21, 2026
This is a permanent fix to cuda graph capture issue. We add a small RAII wrapper around `cudaStreamBeginCapture` / `cudaStreamEndCapture` that detects `cudaErrorStreamCaptureInvalidated` at EndCapture time, drops the (never-issued) partial graph, re-runs the callable eagerly so the current iteration still produces correct results, and stays uninitialized so the next call retries capture. One extra eager pass instead of a crash.

Closing the other PR:#1250
Fixes #1185

Authors:
  - Akif ÇÖRDÜK (https://github.com/akifcorduk)
  - Nicolas Blin (https://github.com/Kh4ster)

Approvers:
  - Hugo Linsenmaier (https://github.com/hlinsen)
  - Ramakrishnap (https://github.com/rgsl888prabhu)

URL: #1253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change P0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Crash on test_incumbent_callbacks (python)

4 participants