Cuda graph replays on capture error by akifcorduk · Pull Request #1253 · NVIDIA/cuopt

akifcorduk · 2026-05-20T02:15:53Z

This is a permanent fix to cuda graph capture issue. We add a small RAII wrapper around cudaStreamBeginCapture / cudaStreamEndCapture that detects cudaErrorStreamCaptureInvalidated at EndCapture time, drops the (never-issued) partial graph, re-runs the callable eagerly so the current iteration still produces correct results, and stays uninitialized so the next call retries capture. One extra eager pass instead of a crash.

Closing the other PR:#1250
Fixes #1185

coderabbitai · 2026-05-20T02:19:26Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Add a lazy manual_cuda_graph_t wrapper and migrate manual CUDA-graph capture/instantiate/launch call sites (PDHG, feasibility jump, ping-pong graph, weighted-average, adaptive step-size) to its run(stream, work) API; re-enable previously skipped incumbent callback tests.

Changes

CUDA Graph Abstraction and Solver Integration

Layer / File(s)	Summary
manual_cuda_graph_t wrapper foundation `cpp/src/utilities/manual_cuda_graph.cuh`	New `manual_cuda_graph_t` with lazy graph capture/instantiate/launch via `run(stream, work)`, capture-invalidation recovery, RAII capture guard, move semantics, `is_initialized()` and `reset()` methods.
ping_pong_graph modernization `cpp/src/pdlp/utilities/ping_pong_graph.cuh`, `cpp/src/pdlp/utilities/ping_pong_graph.cu`	Refactor to two-slot cache using `manual_cuda_graph_t` (`even_graph_`, `odd_graph_`) and single `run(total_pdlp_iterations, work)` API; remove explicit lifecycle methods and associated .cu implementations.
Feasibility jump MIP heuristic CUDA graph wrapper `cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cuh`, `cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`	Replace `graph_instance`/`graph_created` with `manual_cuda_graph_t step_graph_`; move kernel-arg buffers and CUB probe to shared scope; wrap per-iteration kernel sequence in `step_body` and execute via `step_graph_.run(...)` or directly; `reset_cuda_graph` calls `step_graph_.reset()`.
PDHG solver primal/dual step refactoring `cpp/src/pdlp/pdhg.cu`, `cpp/src/pdlp/pdhg.hpp`, `cpp/src/pdlp/pdlp.cu`	Replace manual capture/instantiate/launch control with `graph_all.run(...)`, new `graph_all_non_major.run(...)`, and `graph_prim_proj_gradient_dual.run(...)`; add accessor and member for non-major ping-pong cache; reset both caches after SpMM resize.
Utility components optimization `cpp/src/pdlp/restart_strategy/weighted_average_solution.cu`, `cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu`	Replace conditional capture/instantiate/launch flows with `graph.run(total_pdlp_iterations, [&]{ ... })` wrappers enclosing the same device operations.
Incumbent callback test re-enablement `cpp/tests/mip/incumbent_callback_test.cu`, `python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py`	Add `swath1.mps` back into C++ test parameter lists and remove Python pytest skip marker so `swath1` cases run normally.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/cuopt#1208: Both PRs modify the same incumbent callback tests to change whether the swath1.mps instance runs (main PR un-skips/adds it, retrieved PR temporarily disables/removes it), so the changes directly conflict at the test-parameter level.

Suggested reviewers

tmckayus
chris-maes

Suggested labels

pdlp

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Cuda graph replays on capture error' directly describes the main change: adding resilience to CUDA graph capture by replaying on capture errors.
Description check	✅ Passed	The description is clearly related to the changeset, explaining the RAII wrapper mechanism and error handling strategy introduced in the PR.
Linked Issues check	✅ Passed	The PR fully addresses issue `#1185` by implementing a permanent fix for CUDA graph capture failures, enabling graceful replay on cudaErrorStreamCaptureInvalidated.
Out of Scope Changes check	✅ Passed	All changes are directly related to the CUDA graph capture error handling objective: new manual_cuda_graph_t wrapper, refactored graph-execution APIs in ping_pong_graph_t and related components, and test updates to verify the fix.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/src/utilities/manual_cuda_graph.cuh (2)

136-145: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Double-free of CUDA graph handle when dummy == parent.

With cudaStreamBeginCaptureToGraph, the captured graph returned by cudaStreamEndCapture is the same handle as the parent graph passed in. If user code throws mid-capture and capture wasn't invalidated, dummy == parent, so line 142 destroys it, then line 144 destroys the same handle again.

🐛 Proposed fix: skip destroying dummy since it aliases parent

     ~capture_guard_t() noexcept
     {
       if (capture_active) {
         cudaGraph_t dummy = nullptr;
         // best-effort; we're already unwinding
         cudaStreamEndCapture(stream, &dummy);
-        if (dummy != nullptr) { cudaGraphDestroy(dummy); }
+        // dummy == parent for manual capture; destroying parent below handles it
       }
-      if (parent != nullptr) { cudaGraphDestroy(parent); }
+      if (parent != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaGraphDestroy(parent)); }
     }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/utilities/manual_cuda_graph.cuh` around lines 136 - 145, The
destructor ~capture_guard_t() can double-destroy the same cudaGraph_t when
cudaStreamEndCapture returns the same handle as parent; modify the cleanup so
that after calling cudaStreamEndCapture(stream, &dummy) you only call
cudaGraphDestroy on dummy if dummy != parent (or alternatively only destroy
parent and skip destroying dummy when they alias), ensuring you still destroy
parent if it is non-null and avoid calling cudaGraphDestroy twice on the same
handle (refer to symbols: ~capture_guard_t, capture_active,
cudaStreamEndCapture, dummy, parent, cudaGraphDestroy).

84-85: ⚠️ Potential issue | 🔴 Critical

cudaStreamBeginCaptureToGraph requires CUDA 12.3+ but the project supports CUDA 12.0+.

This API was introduced in CUDA 12.3. The unconditional use at lines 84–85 will cause build failures on CUDA 12.0, 12.1, and 12.2. Either add a version guard with a fallback implementation or update the project's minimum CUDA requirement to 12.3.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/utilities/manual_cuda_graph.cuh` around lines 84 - 85, The call to
cudaStreamBeginCaptureToGraph inside manual_cuda_graph.cuh (wrapped by
RAFT_CUDA_TRY) requires CUDA 12.3+, but the project supports CUDA 12.0–12.2;
guard the call with a CUDA version check (e.g., `#if` defined(CUDA_VERSION) &&
CUDA_VERSION >= 12030) and provide a fallback for older toolkits (call the older
cudaStreamBeginCapture API or cudaStreamBeginCapture(stream.value(),
cudaStreamCaptureModeThreadLocal) within the RAFT_CUDA_TRY) so builds on CUDA
12.0–12.2 use the compatible capture API; ensure both branches use the same
error handling macro (RAFT_CUDA_TRY) and keep the unique symbol names
(cudaStreamBeginCaptureToGraph, cudaStreamBeginCapture, stream.value(),
RAFT_CUDA_TRY) to locate and implement the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/src/utilities/manual_cuda_graph.cuh`:
- Around line 136-145: The destructor ~capture_guard_t() can double-destroy the
same cudaGraph_t when cudaStreamEndCapture returns the same handle as parent;
modify the cleanup so that after calling cudaStreamEndCapture(stream, &dummy)
you only call cudaGraphDestroy on dummy if dummy != parent (or alternatively
only destroy parent and skip destroying dummy when they alias), ensuring you
still destroy parent if it is non-null and avoid calling cudaGraphDestroy twice
on the same handle (refer to symbols: ~capture_guard_t, capture_active,
cudaStreamEndCapture, dummy, parent, cudaGraphDestroy).
- Around line 84-85: The call to cudaStreamBeginCaptureToGraph inside
manual_cuda_graph.cuh (wrapped by RAFT_CUDA_TRY) requires CUDA 12.3+, but the
project supports CUDA 12.0–12.2; guard the call with a CUDA version check (e.g.,
`#if` defined(CUDA_VERSION) && CUDA_VERSION >= 12030) and provide a fallback for
older toolkits (call the older cudaStreamBeginCapture API or
cudaStreamBeginCapture(stream.value(), cudaStreamCaptureModeThreadLocal) within
the RAFT_CUDA_TRY) so builds on CUDA 12.0–12.2 use the compatible capture API;
ensure both branches use the same error handling macro (RAFT_CUDA_TRY) and keep
the unique symbol names (cudaStreamBeginCaptureToGraph, cudaStreamBeginCapture,
stream.value(), RAFT_CUDA_TRY) to locate and implement the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f4efb04c-5af5-4c02-baa4-c8ea9d79fa41

📥 Commits

Reviewing files that changed from the base of the PR and between 0a5149b and 06352db.

📒 Files selected for processing (1)

cpp/src/utilities/manual_cuda_graph.cuh

…apture

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/src/utilities/manual_cuda_graph.cuh (1)

76-92: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Destroy the captured graph on instantiation failure.

After a successful cudaStreamEndCapture, captured owns a graph. If cudaGraphInstantiate fails, the throwing macro skips the cudaGraphDestroy below, so each failed first-run/retry leaks a graph handle.

♻️ Proposed fix

     cudaGraph_t captured = nullptr;
     cudaError_t end_err  = cudaStreamEndCapture(stream.value(), &captured);
     guard.capture_active = false;
@@
-    RAFT_CUDA_TRY(cudaGraphInstantiate(&instance_, captured));
-    RAFT_CUDA_TRY(cudaGraphDestroy(captured));
+    try {
+      RAFT_CUDA_TRY(cudaGraphInstantiate(&instance_, captured));
+      RAFT_CUDA_TRY(cudaGraphDestroy(captured));
+      captured = nullptr;
+    } catch (...) {
+      if (captured != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaGraphDestroy(captured)); }
+      throw;
+    }

As per coding guidelines, "Flag missing RAII in exception paths since cuOpt uses exceptions."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/utilities/manual_cuda_graph.cuh` around lines 76 - 92, After
cudaStreamEndCapture succeeds and before calling
cudaGraphInstantiate(&instance_, captured), ensure that on instantiate failure
the captured graph is destroyed to avoid leaking the graph handle; replace or
wrap the RAFT_CUDA_TRY(cudaGraphInstantiate(&instance_, captured)) call so that
if cudaGraphInstantiate returns an error you call cudaGraphDestroy(captured) and
then propagate the error (or rethrow) rather than letting the macro skip the
destroy; reference symbols: captured, cudaGraphInstantiate, cudaGraphDestroy,
RAFT_CUDA_TRY, and instance_.

🧹 Nitpick comments (1)

cpp/src/utilities/manual_cuda_graph.cuh (1)

23-34: ⚡ Quick win

Document that invalidation recovery replays work() on the host.

The fallback path preserves device-side results, but any host-side mutation inside work has already happened once during the failed capture attempt and will happen again here. Please call out that work must be host-idempotent, or keep host bookkeeping outside the callable.

📝 Suggested clarification

 // Wrapper around a CUDA graph captured from a callable. CUB / Thrust / RAFT /
 // cuSPARSE calls inside the captured region are preserved.
+// `work` must not perform non-idempotent host-side mutations: if capture is
+// invalidated, it is executed once during the failed capture attempt and once
+// again in the eager fallback path.
 //
 // Invalidation recovery:

Also applies to: 80-87

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/utilities/manual_cuda_graph.cuh` around lines 23 - 34, Add a sentence
to the invalidation recovery doc block explaining that when cudaStreamEndCapture
returns cudaErrorStreamCaptureInvalidated the wrapper drains the sticky error
and re-executes the provided callable (work()) on the host, so any host-side
mutations inside work() will run twice; update text near the description of
cudaStreamEndCapture, work(), and run to state that work must be host-idempotent
or that host bookkeeping should be moved out of the callable to avoid double
application during recovery.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/src/utilities/manual_cuda_graph.cuh`:
- Around line 76-92: After cudaStreamEndCapture succeeds and before calling
cudaGraphInstantiate(&instance_, captured), ensure that on instantiate failure
the captured graph is destroyed to avoid leaking the graph handle; replace or
wrap the RAFT_CUDA_TRY(cudaGraphInstantiate(&instance_, captured)) call so that
if cudaGraphInstantiate returns an error you call cudaGraphDestroy(captured) and
then propagate the error (or rethrow) rather than letting the macro skip the
destroy; reference symbols: captured, cudaGraphInstantiate, cudaGraphDestroy,
RAFT_CUDA_TRY, and instance_.

---

Nitpick comments:
In `@cpp/src/utilities/manual_cuda_graph.cuh`:
- Around line 23-34: Add a sentence to the invalidation recovery doc block
explaining that when cudaStreamEndCapture returns
cudaErrorStreamCaptureInvalidated the wrapper drains the sticky error and
re-executes the provided callable (work()) on the host, so any host-side
mutations inside work() will run twice; update text near the description of
cudaStreamEndCapture, work(), and run to state that work must be host-idempotent
or that host bookkeeping should be moved out of the callable to avoid double
application during recovery.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 376b74a0-8809-454f-88b6-7a0b5a4546bc

📥 Commits

Reviewing files that changed from the base of the PR and between 06352db and 4d2fb18.

📒 Files selected for processing (1)

cpp/src/utilities/manual_cuda_graph.cuh

Kh4ster · 2026-05-20T14:02:25Z

+  if (use_graph) {
+    step_graph_.run(climber_stream, step_body);
+  } else {
+    step_body();


That's elegant I like it

Kh4ster · 2026-05-20T14:47:21Z

-// caution Binary part is because in pdlp we swap pointers instead of copying vectors to accept a
-// valid pdhg step So every odd pdlp step it's one graph, every even step it's another graph
+// Two-slot CUDA-graph cache for PDLP. PDLP swaps pointers (rather than
+// copying vectors) at the end of every pdhg step, so the captured graph


I would just say adaptive pdhg step. There is no pointer swap when using fixed pdhg step (which is what we do for default/Stable3)

Kh4ster · 2026-05-20T14:48:32Z

    // Currently graph capture is not supported for cuSparse SpMM
    // TODO enable once cuSparse SpMM supports graph capture
    graph_all{stream_view_, is_legacy_batch_mode || batch_mode_},
+    graph_all_non_major{stream_view_, is_legacy_batch_mode || batch_mode_},


Why not keep a single graph_all and use the is_major to swap between major and non-major like before?

Clarifying things on slack.

Confirm this is wrong, reverting back the previous graph_all mecanism

Kh4ster · 2026-05-20T14:54:38Z

+ private:
+  // RAII helper: cleans up a partial capture if the user-supplied callable
+  // throws between start- and end-capture.
+  struct capture_guard_t {


Very cool mechanism, I love it

Credits to Claude :)

Kh4ster · 2026-05-20T14:55:49Z

+    capture_guard_t guard{stream.value()};
+
+    RAFT_CUDA_TRY(cudaStreamBeginCapture(stream.value(), cudaStreamCaptureModeThreadLocal));
+    guard.capture_active = true;


I guess there is a tiny risk here if there is an error exactly between begin capture and settings capture_active = true

I think since the error is contained in this thread, it should be okay. If there is an exception in another thread and is uncaught, it will abort the process anyway.

hlinsen

Very clean solution, thanks @akifcorduk!

Kh4ster · 2026-05-21T08:24:58Z

+  // swaps the primal/dual ping-pong buffers between outer pdlp iterations — so the captured
+  // graph's baked-in pointers depend on `total_pdlp_iterations` parity, not on `should_major`.
+  // Use a dedicated ping-pong cache per branch and key each on `total_pdlp_iterations` so each
+  // (branch, parity) pair maps to its own cached executable.


That's untrue, there is no update_solution when using reflected. update_solution is only called in the take_adaptive_step while reflected is used in the take_constant_step

copy-pr-bot · 2026-05-21T09:45:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Kh4ster · 2026-05-21T09:46:09Z

/ok to test 035150d

Kh4ster · 2026-05-21T11:43:06Z

/merge

akifcorduk added 2 commits May 19, 2026 22:05

add cuda error recovery for capture

8ad61dd

test CI

0a5149b

akifcorduk added this to the 26.06 milestone May 20, 2026

akifcorduk requested a review from a team as a code owner May 20, 2026 02:15

akifcorduk added the bug Something isn't working label May 20, 2026

akifcorduk requested a review from a team as a code owner May 20, 2026 02:15

akifcorduk added the non-breaking Introduces a non-breaking change label May 20, 2026

akifcorduk requested review from chris-maes, hlinsen and tmckayus May 20, 2026 02:15

akifcorduk mentioned this pull request May 20, 2026

Fix CUDA Graph Capture/Sync race #1250

Closed

fix logger

06352db

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Merge branch 'main' of github.com:NVIDIA/cuopt into cuda_graph_side_c…

36e74d2

…apture

anandhkb added the P0 label May 20, 2026

restore the api and use api suitable for <12.3

4d2fb18

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

akifcorduk changed the title ~~[DON'T MERGE] Cuda graph side capture~~ Cuda graph replays on capture error May 20, 2026

akifcorduk added 2 commits May 20, 2026 07:05

more comments

dba39c8

fix ping pong graph major, non-major logic

56b6e84

Kh4ster reviewed May 20, 2026

View reviewed changes

rgsl888prabhu changed the base branch from main to release/26.06 May 20, 2026 17:26

hlinsen approved these changes May 20, 2026

View reviewed changes

Comment thread cpp/src/utilities/manual_cuda_graph.cuh

comment change

d8dce3c

rgsl888prabhu approved these changes May 20, 2026

View reviewed changes

Kh4ster reviewed May 21, 2026

View reviewed changes

revert back the graph_major change to go back to initial behavior

f753aa5

Merge branch 'release/26.06' into cuda_graph_side_capture

035150d

Kh4ster removed request for chris-maes and tmckayus May 21, 2026 10:21

Kh4ster assigned Kh4ster and akifcorduk May 21, 2026

rapids-bot Bot merged commit c5f8f44 into NVIDIA:release/26.06 May 21, 2026
98 checks passed

Conversation

akifcorduk commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hlinsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

Kh4ster commented May 21, 2026

Uh oh!

Kh4ster commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

akifcorduk commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading