Fix concurrent LP exception cleanup#1206
Conversation
|
/ok to test 2af4f87 |
2af4f87 to
4969003
Compare
Join concurrent solver worker threads before rethrowing exceptions so std::thread destructors do not terminate the process. Add a deterministic regression test that exercises a PDLP validation error after concurrent workers start. Co-authored-by: Codex <codex@openai.com>
4969003 to
b806ca3
Compare
|
/ok to test b806ca3 |
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughIn ChangesConcurrent Exception Handling
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
/merge |
PR NVIDIA#1206 (Fix concurrent LP exception cleanup) landed on main while this branch already renamed cuopt::mps_parser -> cuopt::linear_programming::io. The merge auto-resolved the file but the two parser references added by NVIDIA#1206 in the new test (concurrent_pdlp_exception_joins_worker_threads) kept the old namespace. Rename them to match the rest of the file. Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
Brings in upstream's unified OpenMP threading model (PR NVIDIA#1099) and other fixes (NVIDIA#1206 concurrent LP exception cleanup, NVIDIA#1214/NVIDIA#1216 destruction/ capture fixes) while preserving local work on the cut and clique stack. Conflict resolution highlights: - Drop std::future/std::async clique flow; adopt upstream's omp task + omp_atomic_t<bool> signal_extend pattern. - Drop modify_problem parameter from find_initial_cliques (we already removed the code that consumed it); adapt the omp-task call site in branch_and_bound::solve accordingly. - Take upstream's [this, &population] capture for the root-LP CPUFJ improvement callback; the new omp taskwait-before-destruction guarantee makes the prior context-lifetime fix unnecessary. - Take upstream's do_cut_pass refactor of the per-pass LP resolve loop; move our per-pass root_lp_with_cuts publish into do_cut_pass so the benchmark metric is still updated on early exits. - Keep our out-of-line omp_mutex_t definitions in omp_helpers.cpp; the enhanced omp_atomic_t with std::memory_order is taken from upstream.
Join concurrent solver worker threads before rethrowing exceptions so std::thread destructors do not terminate the process. Add a deterministic regression test that exercises a PDLP validation error after concurrent workers start.
Prevents some core dumps and gives a more useful error message for the test_incumbent_callbacks flaky test failure.
Old failure:
```
| Explored | Unexplored | Objective | Bound | IntInf | Depth | Iter/Node | Gap | Time |
terminate called without an active exception
Fatal Python error: Aborted
Current thread 0x0000e727cfdde020 (most recent call first):
File "/pyenv/versions/3.11.15/lib/python3.11/site-packages/cuopt/linear_programming/solver/solver.py", line 98 in Solve
File "/pyenv/versions/3.11.15/lib/python3.11/site-packages/cuopt/utilities/exception_handler.py", line 24 in func
File "/__w/cuopt/cuopt/python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py", line 87 in _run_incumbent_solver_callback
File "/__w/cuopt/cuopt/python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py", line 112 in test_incumbent_get_callback
```
New failure:
```
=================================== FAILURES ===================================
_________________ test_incumbent_get_callback[/mip/swath1.mps] _________________
file_name = '/mip/swath1.mps'
@pytest.mark.parametrize(
"file_name",
[
("/mip/swath1.mps"),
("/mip/neos5-free-bound.mps"),
],
)
def test_incumbent_get_callback(file_name):
> _run_incumbent_solver_callback(file_name, include_set_callback=False)
tests/linear_programming/test_incumbent_callbacks.py:112:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/linear_programming/test_incumbent_callbacks.py:87: in _run_incumbent_solver_callback
solution = solver.Solve(data_model_obj, settings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/opt/conda/envs/test/lib/python3.12/site-packages/cuopt/utilities/exception_handler.py:48: in func
raise e
/opt/conda/envs/test/lib/python3.12/site-packages/cuopt/utilities/exception_handler.py:24: in func
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
/opt/conda/envs/test/lib/python3.12/site-packages/cuopt/linear_programming/solver/solver.py:98: in Solve
s = solver_wrapper.Solve(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E RuntimeError: CUDA error encountered at: file=/tmp/conda-bld-output/bld/rattler-build_libmps-parser/work/cpp/src/pdlp/utilities/ping_pong_graph.cu line=57: call='cudaStreamEndCapture(stream_view_.value(), &even_graph)', Reason=cudaErrorStreamCaptureInvalidated:operation failed due to a previous error during capture
```
Authors:
- Miles Lubin (https://github.com/mlubin)
Approvers:
- Hugo Linsenmaier (https://github.com/hlinsen)
URL: NVIDIA#1206
Join concurrent solver worker threads before rethrowing exceptions so std::thread destructors do not terminate the process. Add a deterministic regression test that exercises a PDLP validation error after concurrent workers start.
Prevents some core dumps and gives a more useful error message for the test_incumbent_callbacks flaky test failure.
Old failure:
New failure: