PLUGINS/UCX: Do not allow emulated RMA protocols#1610
PLUGINS/UCX: Do not allow emulated RMA protocols#1610tvegas1 wants to merge 1 commit intoai-dynamo:mainfrom
Conversation
|
👋 Hi tvegas1! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
8072d05 to
27d94bc
Compare
📝 WalkthroughWalkthroughDuring nixlUcxContext construction, the UCX configuration now sets ChangesUCX Context Configuration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/plugins/ucx/ucx_utils.cpp (1)
421-453:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnforcement gap: emulation can still be re-enabled despite disable intent
Line 421 uses
config.modify(...), which is skipped whenUCX_PROTO_EMULATION_ENABLEis set, and the laterengine_configloop (Lines 447-453) can reapplyPROTO_EMULATION_ENABLE. That means emulated RMA is still user-overridable, which conflicts with “do not allow emulated RMA protocols.”Consider enforcing this as non-overridable (e.g., reject/ignore
PROTO_EMULATION_ENABLEinengine_configand set it withmodifyAlwaysafter parsing overrides).Suggested hard-enforcement pattern
- config.modify("PROTO_EMULATION_ENABLE", "n"); + // Default now; hard-enforced again after engine_config parsing. + config.modifyAlways("PROTO_EMULATION_ENABLE", "n"); @@ while (std::getline(stream, elem, ',')) { std::string_view elem_view = elem; size_t pos = elem_view.find('='); if (pos != std::string::npos) { - config.modify(elem_view.substr(0, pos), elem_view.substr(pos + 1)); + const auto key = elem_view.substr(0, pos); + if (key == "PROTO_EMULATION_ENABLE") { + NIXL_WARN << "Ignoring engine_config override for PROTO_EMULATION_ENABLE"; + continue; + } + config.modify(key, elem_view.substr(pos + 1)); } } + + // Keep emulated RMA disabled regardless of external overrides. + config.modifyAlways("PROTO_EMULATION_ENABLE", "n");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/plugins/ucx/ucx_utils.cpp` around lines 421 - 453, The current code sets PROTO_EMULATION_ENABLE via config.modify but later parses engine_config and can re-enable it; ensure PROTO_EMULATION_ENABLE cannot be overridden by: when parsing engine_config in the engine_config loop (the std::getline loop that produces elem/elem_view and calls config.modify), detect keys equal to "PROTO_EMULATION_ENABLE" (or equivalent constant) and ignore or reject them, and after the loop call config.modifyAlways("PROTO_EMULATION_ENABLE", "n") (or the config API's always-write method) to hard-enforce the disabled state; reference the config.modify calls, the engine_config parsing loop, and use modifyAlways (or equivalent) to set the final non-overridable value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/plugins/ucx/ucx_utils.cpp`:
- Around line 421-453: The current code sets PROTO_EMULATION_ENABLE via
config.modify but later parses engine_config and can re-enable it; ensure
PROTO_EMULATION_ENABLE cannot be overridden by: when parsing engine_config in
the engine_config loop (the std::getline loop that produces elem/elem_view and
calls config.modify), detect keys equal to "PROTO_EMULATION_ENABLE" (or
equivalent constant) and ignore or reject them, and after the loop call
config.modifyAlways("PROTO_EMULATION_ENABLE", "n") (or the config API's
always-write method) to hard-enforce the disabled state; reference the
config.modify calls, the engine_config parsing loop, and use modifyAlways (or
equivalent) to set the final non-overridable value.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: debaef6b-60f6-42b6-8e20-760296d08c78
📒 Files selected for processing (1)
src/plugins/ucx/ucx_utils.cpp
27d94bc to
1fbe3e9
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/plugins/ucx/ucx_utils.cpp (1)
421-421:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winVerify if
PROTO_EMULATION_ENABLErequires a UCX version guard (similar toRC_GDA_NUM_CHANNELS).A previous reviewer flagged that this should potentially be placed inside the
if (ucpVersion_ >= UCP_VERSION(1, 21))block at Line 427. The current code has no version guard. IfPROTO_EMULATION_ENABLEis only recognized by UCX 1.21+, thenucp_config_modify()will silently ignore it on older versions, meaning emulation would remain enabled without any indication — directly contradicting the PR's stated intent of never allowing emulated protocols.Confirm in the UCX source/changelog which version first introduced
PROTO_EMULATION_ENABLE:#!/bin/bash # Check for PROTO_EMULATION_ENABLE in UCX-related files in the repo rg -n "PROTO_EMULATION" --type cpp rg -n "PROTO_EMULATION" --type cmake rg -n "PROTO_EMULATION" --type txt🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/plugins/ucx/ucx_utils.cpp` at line 421, The code unconditionally sets PROTO_EMULATION_ENABLE via config.modify which may be unknown to older UCX versions and therefore ignored; wrap the config.modify("PROTO_EMULATION_ENABLE", "n") call in the same version guard used for RC_GDA_NUM_CHANNELS so it only runs when ucpVersion_ >= UCP_VERSION(1,21) (i.e., inside the if (ucpVersion_ >= UCP_VERSION(1, 21)) block), and ensure any call path that applies config.modify uses the same ucpVersion_ check before calling ucp_config_modify to guarantee older UCX won’t silently ignore the flag.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/plugins/ucx/ucx_utils.cpp`:
- Line 421: The code unconditionally sets PROTO_EMULATION_ENABLE via
config.modify which may be unknown to older UCX versions and therefore ignored;
wrap the config.modify("PROTO_EMULATION_ENABLE", "n") call in the same version
guard used for RC_GDA_NUM_CHANNELS so it only runs when ucpVersion_ >=
UCP_VERSION(1,21) (i.e., inside the if (ucpVersion_ >= UCP_VERSION(1, 21))
block), and ensure any call path that applies config.modify uses the same
ucpVersion_ check before calling ucp_config_modify to guarantee older UCX won’t
silently ignore the flag.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9786496c-7212-4517-bbda-4306addfc8b3
📒 Files selected for processing (1)
src/plugins/ucx/ucx_utils.cpp
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/plugins/ucx/ucx_utils.cpp`:
- Line 427: The current call config.modify("PROTO_EMULATION_ENABLE", "n") is a
soft enforcement and can be overridden by the UCX_PROTO_EMULATION_ENABLE
environment variable; update the code in ucx_utils.cpp so the enforcement cannot
be bypassed by either (a) replace config.modify(...) with
config.modifyAlways("PROTO_EMULATION_ENABLE", "n") to force the setting, or (b)
before modifying, read getenv("UCX_PROTO_EMULATION_ENABLE") and if it exists and
is not "n" log an error and fail initialization (or explicitly clear it),
ensuring the UCX_PROTO_EMULATION_ENABLE env var cannot re-enable emulation;
reference the existing config object and the
PROTO_EMULATION_ENABLE/UCX_PROTO_EMULATION_ENABLE symbols when making the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6e217ca1-2e6d-4a23-8aea-543862d53d29
📒 Files selected for processing (1)
src/plugins/ucx/ucx_utils.cpp
Signed-off-by: Thomas Vegas <tvegas@nvidia.com>
1fbe3e9 to
7b405bf
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/plugins/ucx/ucx_utils.cpp (1)
426-430:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
PROTO_EMULATION_ENABLE(andIB_TX_INLINE_RESP) use soft enforcement, defeating the PR's stated goalBoth new settings at lines 427–428 go through
config.modify(), which (perconfig.cpp:28–31) silently no-ops if the correspondingUCX_*environment variable is already set. ForPROTO_EMULATION_ENABLEthis directly conflicts with the PR objective — "do not allow emulated RMA protocols" — because any pre-setUCX_PROTO_EMULATION_ENABLE=yenvironment variable will cause the call to be ignored without warning. The same argument applies toIB_TX_INLINE_RESPif setting it to"0"is required for correct zero-copy behaviour.Additionally, the engine-config parsing loop (lines 449–455) calls
config.modify()for eachkey=valuepair, so passingPROTO_EMULATION_ENABLE=yinengine_configwould re-enable emulation even without the environment variable.
NET_DEVICES(line 418) usesconfig.modifyAlways()precisely because it must not be silently bypassed; the same rationale applies here.💡 Proposed fix
if (ucpVersion_ >= UCP_VERSION(1, 21)) { - config.modify("IB_TX_INLINE_RESP", "0"); - config.modify("PROTO_EMULATION_ENABLE", "n"); + config.modifyAlways("IB_TX_INLINE_RESP", "0"); + config.modifyAlways("PROTO_EMULATION_ENABLE", "n"); config.modify("RC_GDA_NUM_CHANNELS", std::to_string(num_device_channels)); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/plugins/ucx/ucx_utils.cpp` around lines 426 - 430, The calls to config.modify("IB_TX_INLINE_RESP", "0") and config.modify("PROTO_EMULATION_ENABLE", "n") silently no-op when UCX_* env vars or engine_config override is present, defeating the goal of forbidding emulated RMA; change these to use config.modifyAlways("IB_TX_INLINE_RESP", "0") and config.modifyAlways("PROTO_EMULATION_ENABLE", "n") in ucx_utils.cpp, and update the engine_config parsing loop so that if a key equals "PROTO_EMULATION_ENABLE" or "IB_TX_INLINE_RESP" it calls config.modifyAlways(...) (not config.modify(...)) to prevent re-enabling via engine_config or env vars while leaving other keys unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/plugins/ucx/ucx_utils.cpp`:
- Around line 426-430: The calls to config.modify("IB_TX_INLINE_RESP", "0") and
config.modify("PROTO_EMULATION_ENABLE", "n") silently no-op when UCX_* env vars
or engine_config override is present, defeating the goal of forbidding emulated
RMA; change these to use config.modifyAlways("IB_TX_INLINE_RESP", "0") and
config.modifyAlways("PROTO_EMULATION_ENABLE", "n") in ucx_utils.cpp, and update
the engine_config parsing loop so that if a key equals "PROTO_EMULATION_ENABLE"
or "IB_TX_INLINE_RESP" it calls config.modifyAlways(...) (not
config.modify(...)) to prevent re-enabling via engine_config or env vars while
leaving other keys unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 66173add-09ca-4de1-bdf5-e7e76a4bef69
📒 Files selected for processing (1)
src/plugins/ucx/ucx_utils.cpp
|
/build |
What?
Disable emulated RMA protocols in UCX plugin. An explicit error will be triggered if some non-zcopy protocols would have to be used.
Why?
We want obvious failure when zcopy protocols are not usable.
How?
Leverage added option start in UCX 1.21.
Summary by CodeRabbit