Fix CUDA Graph Capture/Sync race#1250
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughBarrier-thread initialization in the concurrent PDLP solver is reordered: the barrier-side raft::handle_t is created on the main thread, cuDSS warmup is run, and a promise/future readiness handshake is added so the main thread waits before starting PDLP CUDA graph capture. Separately, swath1 incumbent-callback test cases are no longer skipped. ChangesCUDA Graph Capture Synchronization
Incumbent callback tests
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/src/pdlp/solve.cu (1)
1545-1563:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard the pre-capture setup with thread cleanup.
dual_simplex_threadcan already be running by the timebarrier_handle_ptrconstruction andwarmup_cudss()execute. If any of that setup throws, unwinding will destroy a joinablestd::threadand hard-abort viastd::terminate()before the error can propagate cleanly. Please either defer spawning until all throwing setup is done, or add an RAII joiner/scope guard for both worker threads. As per coding guidelines, "Flag missing RAII in exception paths since cuOpt uses exceptions" and "Flag deadlock potential in resource acquisition patterns".Also applies to: 1616-1630
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/solve.cu` around lines 1545 - 1563, The code spawns dual_simplex_thread before later setup (barrier_handle_ptr construction and warmup_cudss()) that may throw, risking std::terminate on unwind; fix by moving thread creation until after all potentially-throwing setup completes or by adding an RAII thread-join guard to ensure any joinable std::thread is joined in destructors. Concretely: either defer launching dual_simplex_thread (and the analogous other worker thread) until after barrier_handle_ptr is constructed and warmup_cudss() returns, or introduce a small scope guard/type (e.g., ThreadJoinGuard) that takes references to dual_simplex_thread (and the other thread) and joins in its destructor so exceptions during subsequent setup won’t abort; ensure run_dual_simplex_thread and dual_simplex_thread usage remains unchanged aside from creation timing or guarding.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1545-1563: The code spawns dual_simplex_thread before later setup
(barrier_handle_ptr construction and warmup_cudss()) that may throw, risking
std::terminate on unwind; fix by moving thread creation until after all
potentially-throwing setup completes or by adding an RAII thread-join guard to
ensure any joinable std::thread is joined in destructors. Concretely: either
defer launching dual_simplex_thread (and the analogous other worker thread)
until after barrier_handle_ptr is constructed and warmup_cudss() returns, or
introduce a small scope guard/type (e.g., ThreadJoinGuard) that takes references
to dual_simplex_thread (and the other thread) and joins in its destructor so
exceptions during subsequent setup won’t abort; ensure run_dual_simplex_thread
and dual_simplex_thread usage remains unchanged aside from creation timing or
guarding.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 79004054-042e-47a2-9469-038774f1a555
📒 Files selected for processing (1)
cpp/src/pdlp/solve.cu
|
Please also re-enable the C++ tests in cpp/tests/mip/incumbent_callback_test.cu (search "swath1 is temporarily disabled"). |
| raft::device_setter device_setter(1); | ||
| CUOPT_LOG_DEBUG("Barrier device: %d", device_setter.get_current_device()); | ||
| barrier_handle_ptr = std::make_unique<raft::handle_t>(); | ||
| warmup_cudss(); |
There was a problem hiding this comment.
I think we may need to explicitly call get_cublas_handle etc. Raft does lazy init https://github.com/rapidsai/raft/blob/main/cpp/include/raft/core/resource/cublas_handle.hpp#L62
There was a problem hiding this comment.
Hmm, good catch Hugo! I wonder how many of those lazy inits we have. I think this will get obfuscated if we need to call check the lazy inits of every library call. I am working on a permanent fix, if that works I will close this PR and we don't need to do this warm start for each library resource.
There was a problem hiding this comment.
I am trying this: #1253 if this works, we don't need this PR and it is a permanent fix.
|
Closing the PR because of the permanent fix another PR: #1253 |
This is a permanent fix to cuda graph capture issue. We add a small RAII wrapper around `cudaStreamBeginCapture` / `cudaStreamEndCapture` that detects `cudaErrorStreamCaptureInvalidated` at EndCapture time, drops the (never-issued) partial graph, re-runs the callable eagerly so the current iteration still produces correct results, and stays uninitialized so the next call retries capture. One extra eager pass instead of a crash. Closing the other PR:#1250 Fixes #1185 Authors: - Akif ÇÖRDÜK (https://github.com/akifcorduk) - Nicolas Blin (https://github.com/Kh4ster) Approvers: - Hugo Linsenmaier (https://github.com/hlinsen) - Ramakrishnap (https://github.com/rgsl888prabhu) URL: #1253
run_concurrent in cpp/src/pdlp/solve.cu constructed the barrier raft::handle_t inside the spawned barrier std::thread, concurrently with PDLP's graph capture on the main thread. First-use init of cuBLAS / cuSPARSE / cuSolverDn / cuDSS triggers synchronous cudaMalloc / cudaDeviceSynchronize under the hood, which invalidates the in-flight capture.
We now construct barrier_handle_ptr on the main thread. Init cudssCreate+cudssDestroy on the main thread to force the CUDA driver's lazy cuDSS module load to settle outside the capture window.
std::promise handshake so run_pdlp only starts after the barrier worker is scheduled. Promise is set unconditionally (including in the catch block) so the main thread cannot deadlock.
Closes #1185