Skip to content

Cuda graph replays on capture error#1253

Merged
rapids-bot[bot] merged 10 commits into
NVIDIA:release/26.06from
akifcorduk:cuda_graph_side_capture
May 21, 2026
Merged

Cuda graph replays on capture error#1253
rapids-bot[bot] merged 10 commits into
NVIDIA:release/26.06from
akifcorduk:cuda_graph_side_capture

Conversation

@akifcorduk
Copy link
Copy Markdown
Contributor

@akifcorduk akifcorduk commented May 20, 2026

This is a permanent fix to cuda graph capture issue. We add a small RAII wrapper around cudaStreamBeginCapture / cudaStreamEndCapture that detects cudaErrorStreamCaptureInvalidated at EndCapture time, drops the (never-issued) partial graph, re-runs the callable eagerly so the current iteration still produces correct results, and stays uninitialized so the next call retries capture. One extra eager pass instead of a crash.

Closing the other PR:#1250
Fixes #1185

@akifcorduk akifcorduk added this to the 26.06 milestone May 20, 2026
@akifcorduk akifcorduk requested a review from a team as a code owner May 20, 2026 02:15
@akifcorduk akifcorduk added the bug Something isn't working label May 20, 2026
@akifcorduk akifcorduk requested a review from a team as a code owner May 20, 2026 02:15
@akifcorduk akifcorduk added the non-breaking Introduces a non-breaking change label May 20, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Add a lazy manual_cuda_graph_t wrapper and migrate manual CUDA-graph capture/instantiate/launch call sites (PDHG, feasibility jump, ping-pong graph, weighted-average, adaptive step-size) to its run(stream, work) API; re-enable previously skipped incumbent callback tests.

Changes

CUDA Graph Abstraction and Solver Integration

Layer / File(s) Summary
manual_cuda_graph_t wrapper foundation
cpp/src/utilities/manual_cuda_graph.cuh
New manual_cuda_graph_t with lazy graph capture/instantiate/launch via run(stream, work), capture-invalidation recovery, RAII capture guard, move semantics, is_initialized() and reset() methods.
ping_pong_graph modernization
cpp/src/pdlp/utilities/ping_pong_graph.cuh, cpp/src/pdlp/utilities/ping_pong_graph.cu
Refactor to two-slot cache using manual_cuda_graph_t (even_graph_, odd_graph_) and single run(total_pdlp_iterations, work) API; remove explicit lifecycle methods and associated .cu implementations.
Feasibility jump MIP heuristic CUDA graph wrapper
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cuh, cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu
Replace graph_instance/graph_created with manual_cuda_graph_t step_graph_; move kernel-arg buffers and CUB probe to shared scope; wrap per-iteration kernel sequence in step_body and execute via step_graph_.run(...) or directly; reset_cuda_graph calls step_graph_.reset().
PDHG solver primal/dual step refactoring
cpp/src/pdlp/pdhg.cu, cpp/src/pdlp/pdhg.hpp, cpp/src/pdlp/pdlp.cu
Replace manual capture/instantiate/launch control with graph_all.run(...), new graph_all_non_major.run(...), and graph_prim_proj_gradient_dual.run(...); add accessor and member for non-major ping-pong cache; reset both caches after SpMM resize.
Utility components optimization
cpp/src/pdlp/restart_strategy/weighted_average_solution.cu, cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu
Replace conditional capture/instantiate/launch flows with graph.run(total_pdlp_iterations, [&]{ ... }) wrappers enclosing the same device operations.
Incumbent callback test re-enablement
cpp/tests/mip/incumbent_callback_test.cu, python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py
Add swath1.mps back into C++ test parameter lists and remove Python pytest skip marker so swath1 cases run normally.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/cuopt#1208: Both PRs modify the same incumbent callback tests to change whether the swath1.mps instance runs (main PR un-skips/adds it, retrieved PR temporarily disables/removes it), so the changes directly conflict at the test-parameter level.

Suggested reviewers

  • tmckayus
  • chris-maes

Suggested labels

pdlp

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Cuda graph replays on capture error' directly describes the main change: adding resilience to CUDA graph capture by replaying on capture errors.
Description check ✅ Passed The description is clearly related to the changeset, explaining the RAII wrapper mechanism and error handling strategy introduced in the PR.
Linked Issues check ✅ Passed The PR fully addresses issue #1185 by implementing a permanent fix for CUDA graph capture failures, enabling graceful replay on cudaErrorStreamCaptureInvalidated.
Out of Scope Changes check ✅ Passed All changes are directly related to the CUDA graph capture error handling objective: new manual_cuda_graph_t wrapper, refactored graph-execution APIs in ping_pong_graph_t and related components, and test updates to verify the fix.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/src/utilities/manual_cuda_graph.cuh (2)

136-145: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Double-free of CUDA graph handle when dummy == parent.

With cudaStreamBeginCaptureToGraph, the captured graph returned by cudaStreamEndCapture is the same handle as the parent graph passed in. If user code throws mid-capture and capture wasn't invalidated, dummy == parent, so line 142 destroys it, then line 144 destroys the same handle again.

🐛 Proposed fix: skip destroying dummy since it aliases parent
     ~capture_guard_t() noexcept
     {
       if (capture_active) {
         cudaGraph_t dummy = nullptr;
         // best-effort; we're already unwinding
         cudaStreamEndCapture(stream, &dummy);
-        if (dummy != nullptr) { cudaGraphDestroy(dummy); }
+        // dummy == parent for manual capture; destroying parent below handles it
       }
-      if (parent != nullptr) { cudaGraphDestroy(parent); }
+      if (parent != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaGraphDestroy(parent)); }
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/utilities/manual_cuda_graph.cuh` around lines 136 - 145, The
destructor ~capture_guard_t() can double-destroy the same cudaGraph_t when
cudaStreamEndCapture returns the same handle as parent; modify the cleanup so
that after calling cudaStreamEndCapture(stream, &dummy) you only call
cudaGraphDestroy on dummy if dummy != parent (or alternatively only destroy
parent and skip destroying dummy when they alias), ensuring you still destroy
parent if it is non-null and avoid calling cudaGraphDestroy twice on the same
handle (refer to symbols: ~capture_guard_t, capture_active,
cudaStreamEndCapture, dummy, parent, cudaGraphDestroy).

84-85: ⚠️ Potential issue | 🔴 Critical

cudaStreamBeginCaptureToGraph requires CUDA 12.3+ but the project supports CUDA 12.0+.

This API was introduced in CUDA 12.3. The unconditional use at lines 84–85 will cause build failures on CUDA 12.0, 12.1, and 12.2. Either add a version guard with a fallback implementation or update the project's minimum CUDA requirement to 12.3.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/utilities/manual_cuda_graph.cuh` around lines 84 - 85, The call to
cudaStreamBeginCaptureToGraph inside manual_cuda_graph.cuh (wrapped by
RAFT_CUDA_TRY) requires CUDA 12.3+, but the project supports CUDA 12.0–12.2;
guard the call with a CUDA version check (e.g., `#if` defined(CUDA_VERSION) &&
CUDA_VERSION >= 12030) and provide a fallback for older toolkits (call the older
cudaStreamBeginCapture API or cudaStreamBeginCapture(stream.value(),
cudaStreamCaptureModeThreadLocal) within the RAFT_CUDA_TRY) so builds on CUDA
12.0–12.2 use the compatible capture API; ensure both branches use the same
error handling macro (RAFT_CUDA_TRY) and keep the unique symbol names
(cudaStreamBeginCaptureToGraph, cudaStreamBeginCapture, stream.value(),
RAFT_CUDA_TRY) to locate and implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/src/utilities/manual_cuda_graph.cuh`:
- Around line 136-145: The destructor ~capture_guard_t() can double-destroy the
same cudaGraph_t when cudaStreamEndCapture returns the same handle as parent;
modify the cleanup so that after calling cudaStreamEndCapture(stream, &dummy)
you only call cudaGraphDestroy on dummy if dummy != parent (or alternatively
only destroy parent and skip destroying dummy when they alias), ensuring you
still destroy parent if it is non-null and avoid calling cudaGraphDestroy twice
on the same handle (refer to symbols: ~capture_guard_t, capture_active,
cudaStreamEndCapture, dummy, parent, cudaGraphDestroy).
- Around line 84-85: The call to cudaStreamBeginCaptureToGraph inside
manual_cuda_graph.cuh (wrapped by RAFT_CUDA_TRY) requires CUDA 12.3+, but the
project supports CUDA 12.0–12.2; guard the call with a CUDA version check (e.g.,
`#if` defined(CUDA_VERSION) && CUDA_VERSION >= 12030) and provide a fallback for
older toolkits (call the older cudaStreamBeginCapture API or
cudaStreamBeginCapture(stream.value(), cudaStreamCaptureModeThreadLocal) within
the RAFT_CUDA_TRY) so builds on CUDA 12.0–12.2 use the compatible capture API;
ensure both branches use the same error handling macro (RAFT_CUDA_TRY) and keep
the unique symbol names (cudaStreamBeginCaptureToGraph, cudaStreamBeginCapture,
stream.value(), RAFT_CUDA_TRY) to locate and implement the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f4efb04c-5af5-4c02-baa4-c8ea9d79fa41

📥 Commits

Reviewing files that changed from the base of the PR and between 0a5149b and 06352db.

📒 Files selected for processing (1)
  • cpp/src/utilities/manual_cuda_graph.cuh

@anandhkb anandhkb added the P0 label May 20, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/utilities/manual_cuda_graph.cuh (1)

76-92: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Destroy the captured graph on instantiation failure.

After a successful cudaStreamEndCapture, captured owns a graph. If cudaGraphInstantiate fails, the throwing macro skips the cudaGraphDestroy below, so each failed first-run/retry leaks a graph handle.

♻️ Proposed fix
     cudaGraph_t captured = nullptr;
     cudaError_t end_err  = cudaStreamEndCapture(stream.value(), &captured);
     guard.capture_active = false;
@@
-    RAFT_CUDA_TRY(cudaGraphInstantiate(&instance_, captured));
-    RAFT_CUDA_TRY(cudaGraphDestroy(captured));
+    try {
+      RAFT_CUDA_TRY(cudaGraphInstantiate(&instance_, captured));
+      RAFT_CUDA_TRY(cudaGraphDestroy(captured));
+      captured = nullptr;
+    } catch (...) {
+      if (captured != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaGraphDestroy(captured)); }
+      throw;
+    }

As per coding guidelines, "Flag missing RAII in exception paths since cuOpt uses exceptions."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/utilities/manual_cuda_graph.cuh` around lines 76 - 92, After
cudaStreamEndCapture succeeds and before calling
cudaGraphInstantiate(&instance_, captured), ensure that on instantiate failure
the captured graph is destroyed to avoid leaking the graph handle; replace or
wrap the RAFT_CUDA_TRY(cudaGraphInstantiate(&instance_, captured)) call so that
if cudaGraphInstantiate returns an error you call cudaGraphDestroy(captured) and
then propagate the error (or rethrow) rather than letting the macro skip the
destroy; reference symbols: captured, cudaGraphInstantiate, cudaGraphDestroy,
RAFT_CUDA_TRY, and instance_.
🧹 Nitpick comments (1)
cpp/src/utilities/manual_cuda_graph.cuh (1)

23-34: ⚡ Quick win

Document that invalidation recovery replays work() on the host.

The fallback path preserves device-side results, but any host-side mutation inside work has already happened once during the failed capture attempt and will happen again here. Please call out that work must be host-idempotent, or keep host bookkeeping outside the callable.

📝 Suggested clarification
 // Wrapper around a CUDA graph captured from a callable. CUB / Thrust / RAFT /
 // cuSPARSE calls inside the captured region are preserved.
+// `work` must not perform non-idempotent host-side mutations: if capture is
+// invalidated, it is executed once during the failed capture attempt and once
+// again in the eager fallback path.
 //
 // Invalidation recovery:

Also applies to: 80-87

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/utilities/manual_cuda_graph.cuh` around lines 23 - 34, Add a sentence
to the invalidation recovery doc block explaining that when cudaStreamEndCapture
returns cudaErrorStreamCaptureInvalidated the wrapper drains the sticky error
and re-executes the provided callable (work()) on the host, so any host-side
mutations inside work() will run twice; update text near the description of
cudaStreamEndCapture, work(), and run to state that work must be host-idempotent
or that host bookkeeping should be moved out of the callable to avoid double
application during recovery.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/src/utilities/manual_cuda_graph.cuh`:
- Around line 76-92: After cudaStreamEndCapture succeeds and before calling
cudaGraphInstantiate(&instance_, captured), ensure that on instantiate failure
the captured graph is destroyed to avoid leaking the graph handle; replace or
wrap the RAFT_CUDA_TRY(cudaGraphInstantiate(&instance_, captured)) call so that
if cudaGraphInstantiate returns an error you call cudaGraphDestroy(captured) and
then propagate the error (or rethrow) rather than letting the macro skip the
destroy; reference symbols: captured, cudaGraphInstantiate, cudaGraphDestroy,
RAFT_CUDA_TRY, and instance_.

---

Nitpick comments:
In `@cpp/src/utilities/manual_cuda_graph.cuh`:
- Around line 23-34: Add a sentence to the invalidation recovery doc block
explaining that when cudaStreamEndCapture returns
cudaErrorStreamCaptureInvalidated the wrapper drains the sticky error and
re-executes the provided callable (work()) on the host, so any host-side
mutations inside work() will run twice; update text near the description of
cudaStreamEndCapture, work(), and run to state that work must be host-idempotent
or that host bookkeeping should be moved out of the callable to avoid double
application during recovery.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 376b74a0-8809-454f-88b6-7a0b5a4546bc

📥 Commits

Reviewing files that changed from the base of the PR and between 06352db and 4d2fb18.

📒 Files selected for processing (1)
  • cpp/src/utilities/manual_cuda_graph.cuh

@akifcorduk akifcorduk changed the title [DON'T MERGE] Cuda graph side capture Cuda graph replays on capture error May 20, 2026
if (use_graph) {
step_graph_.run(climber_stream, step_body);
} else {
step_body();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's elegant I like it

// caution Binary part is because in pdlp we swap pointers instead of copying vectors to accept a
// valid pdhg step So every odd pdlp step it's one graph, every even step it's another graph
// Two-slot CUDA-graph cache for PDLP. PDLP swaps pointers (rather than
// copying vectors) at the end of every pdhg step, so the captured graph
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just say adaptive pdhg step. There is no pointer swap when using fixed pdhg step (which is what we do for default/Stable3)

Comment thread cpp/src/pdlp/pdhg.cu Outdated
// Currently graph capture is not supported for cuSparse SpMM
// TODO enable once cuSparse SpMM supports graph capture
graph_all{stream_view_, is_legacy_batch_mode || batch_mode_},
graph_all_non_major{stream_view_, is_legacy_batch_mode || batch_mode_},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep a single graph_all and use the is_major to swap between major and non-major like before?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying things on slack.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm this is wrong, reverting back the previous graph_all mecanism

private:
// RAII helper: cleans up a partial capture if the user-supplied callable
// throws between start- and end-capture.
struct capture_guard_t {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool mechanism, I love it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credits to Claude :)

capture_guard_t guard{stream.value()};

RAFT_CUDA_TRY(cudaStreamBeginCapture(stream.value(), cudaStreamCaptureModeThreadLocal));
guard.capture_active = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is a tiny risk here if there is an error exactly between begin capture and settings capture_active = true

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since the error is contained in this thread, it should be okay. If there is an exception in another thread and is uncaught, it will abort the process anyway.

@rgsl888prabhu rgsl888prabhu changed the base branch from main to release/26.06 May 20, 2026 17:26
Copy link
Copy Markdown
Contributor

@hlinsen hlinsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean solution, thanks @akifcorduk!

Comment thread cpp/src/utilities/manual_cuda_graph.cuh
Comment thread cpp/src/pdlp/pdhg.cu Outdated
// swaps the primal/dual ping-pong buffers between outer pdlp iterations — so the captured
// graph's baked-in pointers depend on `total_pdlp_iterations` parity, not on `should_major`.
// Use a dedicated ping-pong cache per branch and key each on `total_pdlp_iterations` so each
// (branch, parity) pair maps to its own cached executable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's untrue, there is no update_solution when using reflected. update_solution is only called in the take_adaptive_step while reflected is used in the take_constant_step

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Kh4ster
Copy link
Copy Markdown
Contributor

Kh4ster commented May 21, 2026

/ok to test 035150d

@Kh4ster
Copy link
Copy Markdown
Contributor

Kh4ster commented May 21, 2026

/merge

@rapids-bot rapids-bot Bot merged commit c5f8f44 into NVIDIA:release/26.06 May 21, 2026
98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change P0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Crash on test_incumbent_callbacks (python)

5 participants