You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[RCCL] Enforce model index matching across nodes (#5377)
## Motivation
RCCL relies on a single, consistent Rome / GIO preset topology model
index across all ranks so precomputed graphs (rings, trees, etc.) stay
in lockstep. If different ranks infer different romeTopoModelIdx,
behavior is undefined and jobs should fail fast with actionable logs.
## Technical Details
Use majority-style reference (plurality vote) for deciding reference
index. For mismatched ranks, one line will be printed per physical host.
## JIRA ID
## Test Plan
Manually introduce mismatched nodes and check output.
## Test Result
rocm-systems/projects/rccl/build/hipify/src/init.cc:1681 NCCL WARN RCCL
FATAL: mismatched Rome preset topology model index across ranks; all
ranks must agree for precomputed graphs (voted refIdx 40 from 16 of 24
ranks).
rocm-systems/projects/rccl/build/hipify/src/init.cc:1693 NCCL WARN rank
8 host useocpm2m-097-019 romeTopoModelIdx=-1
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
if (getRomeTopoModelIdx(r) != refIdx) nDisagree++;
48
+
}
49
+
if (nDisagree > 0) {
50
+
WARN("RCCL FATAL: mismatched Rome preset topology model index across ranks; all ranks must agree for precomputed graphs (voted refIdx %d from %d of %d ranks).", refIdx, refVotes, nranks);
0 commit comments