feat: add DeepGEMM JIT kernel warmup integrated into C++ engine startup by ydshi0 · Pull Request #967 · alibaba/rtp-llm

ydshi0 · 2026-05-06T08:55:40Z

rtp_llm/models_py/kernels/cuda/deepgemm_warmup.py (new): DeepGEMM JIT warmup implementation — pre-compiles fp8_gemm_nt (dense) and m_grouped_fp8_gemm_nt_masked (MoE) kernels for representative M values to avoid first-request nvcc latency. GroupedGemmWarmupWrapper encapsulates MoE dummy tensor construction behind a forward() interface.
rtp_llm/models_py/model_desc/module_base.py: Add warmup_deep_gemm() method to GptModelBase, reads deepgemm_warmup config and delegates to deepgemm_warmup.deep_gemm_warmup().
rtp_llm/cpp/normal_engine/NormalEngine.cc: Call py_model.warmup_deep_gemm() in NormalEngine constructor before engine warmUp(), so JIT kernels are compiled before fake inference.
rtp_llm/config/model_config.py: Add deepgemm_warmup field to ModelConfig (1=enabled, 0=disabled, full=all M values).
rtp_llm/models_py/BUILD: Add visibility for kernels target to allow subpackage test access.
rtp_llm/models_py/kernels/cuda/test/BUILD + test_deepgemm_warmup.py (new): Unit tests verifying warmup calls fp8 linear forward() and MoE m_grouped_fp8_gemm_nt_masked, plus JIT cache file write/reuse.

CLAassistant · 2026-05-06T08:55:49Z

All committers have signed the CLA.

LLLLKKKK · 2026-05-06T09:16:34Z

AI Code Review - PR #967

Status: BLOCKING

Summary: P0/1 · P1/1 · P2/3 · P3/3

Blocking Issues

P0

NormalEngine 构造函数在未持有 GIL 的状态下调用 Python @ rtp_llm/cpp/normal_engine/NormalEngine.cc:78
- 建议：在新增的 if (!params.py_model.is_none() ...) 块开头插入 py::gil_scoped_acquire gil;，覆盖 py::hasattr、attr(...)() 调用以及 catch 中的 e.what()。

P1

py::error_already_set::what() 也需要在持有 GIL 状态下调用 @ rtp_llm/cpp/normal_engine/NormalEngine.cc:81
- 建议：上面 P0 的 py::gil_scoped_acquire gil; 必须覆盖到 catch 内的 e.what()；或在 catch 里再次 acquire，并先 e.discard_as_unraisable(__func__); 再格式化。

Non-blocking Suggestions

P2

deepgemm warmup 未受 runtime_config.warm_up 总开关约束 @ rtp_llm/cpp/normal_engine/NormalEngine.cc:78
- 建议：将新块合并到既有 if (runtime_config.warm_up && ...) 内（或再加 runtime_config.warm_up 守卫），保留 deepgemm_warmup="0" 作为细粒度禁用旋钮。
max_tokens 的 8192 上限是硬编码魔数 @ rtp_llm/models_py/model_desc/module_base.py:114
- 建议：提取为 DEEPGEMM_WARMUP_MAX_TOKENS = 8192 常量并在文档/配置中说明，或读取已有 runtime_config 中的 batch/seq 上限。
deepgemm_warmup 用字符串 tri-state，缺乏校验 @ rtp_llm/config/model_config.py:528
- 建议：用 Literal["0","1","full"] 或 Enum；在 apply_override_args / setter 处校验非法取值并 raise，或至少 logger.warning 出未知值。

P3

模块级全局集合 _FP8_GEMM_NT_WARMED / _GROUPED_GEMM_WARMED 进程内常驻 @ rtp_llm/models_py/kernels/cuda/deepgemm_warmup.py:95
- 建议：挂在 model 上（model._dg_warmed_shapes）或挂到 module（linear._dg_warmed），随 model GC 自然释放，避免跨实例污染。
warmup_deep_gemm 用裸 except Exception 吞掉所有错误 @ rtp_llm/models_py/model_desc/module_base.py:120
- 建议：只 catch 已知可能的异常（ImportError、RuntimeError），其它向上抛；至少在 warning 中加 exc_info=True 让 traceback 入日志。
BUILD 中新增 py_test 与同目录其他测试 tag 风格不一致 @ rtp_llm/models_py/kernels/cuda/test/BUILD:1
- 建议：参考同 BUILD 现有 target，统一为 tags=["open_skip","H20"] + exec_properties={'gpu':'H20'}，并复用 py_test_deps 列表，避免新依赖入口。

Checklist Violations (12 fail / 84 total)

General Principles Checklist

[6.1] Software Engineering — DIP：高层策略不依赖非必要具体细节 → checklist-only
_GroupedGemmWarmupWrapper.init 通过 isinstance(executor, DeepGemmMaskedExecutor) 显式分支取 executor 内部 _w1/w2 私有属性，与具体实现耦合较紧；当前两类 executor 接口未抽公共基类，留作未来重构方向。
[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue NormalEngine 构造函数在未持有 GIL 的状态下调用 Python
NormalEngine 构造期间在 GIL 已 release 状态下调用 Python，违反 LocalRpcServer 关于"engine init 不持有 GIL"的不变量；需在新代码块内 acquire。
[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue warmup_deep_gemm 用裸 except Exception 吞掉所有错误
module_base.warmup_deep_gemm 用裸 except Exception 静默降级；C++ catch 也只 LOG_WARNING；未区分可恢复 vs 致命错误。
[6.1] Architecture — 兼容性：公开 API/持久数据/配置/环境迁移安全 → issue deepgemm_warmup 用字符串 tri-state，缺乏校验
deepgemm_warmup 是字符串 tri-state，未做合法值校验，写错值会静默回退。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → checklist-only
已有 mock+real Python 单测，但 C++ NormalEngine→py_model.warmup_deep_gemm 这条链路（含 GIL）没有 C++ 侧或集成测试覆盖；目前的 P0 GIL bug 正是因为缺这条端到端用例。

RTP-LLM Checklist

[D] 性能 — 硬编码容量值 → issue max_tokens 的 8192 上限是硬编码魔数
module_base.py:114 max_tokens 上限 8192 是裸 magic number。
[G] 跨语言框架陷阱 — C++ std::list 误用 operator[] → issue NormalEngine 构造函数在未持有 GIL 的状态下调用 Python
虽不是 std::list 反例，但典型 C++↔Python 框架陷阱：NormalEngine.cc 在 LocalRpcServer 显式 gil_scoped_release 后调用 py::hasattr/attr 与 py::error_already_set::what()，必须 acquire GIL。
[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → checklist-only
现有测试只覆盖 Python 内部 warmup_fp8_linear/warmup_moe_grouped_gemm；C++→py_model.warmup_deep_gemm 这条 GIL 敏感的入口没有任何端到端覆盖。

Python Static-First Checklist

[P.A] 静态结构与类型纪律 — 禁止 getattr/setattr literal 访问 → checklist-only
module_base.py:115 getattr(self.config, "deepgemm_warmup", "1") 是 literal getattr 兜底；建议直接读 self.config.deepgemm_warmup（model_config.py 默认已设 "1"，无需 fallback），或在 ModelConfig 端保证字段必存在。
[P.A] 静态结构与类型纪律 — 字符串分发用 Enum/Literal → issue deepgemm_warmup 用字符串 tri-state，缺乏校验
deepgemm_warmup 与 mode 都是裸 str（"0"/"1"/"full"、"relax"/"full"），未用 Literal/Enum。
[P.B] 错误处理 — 禁止 bare except 或静默吞异常 → issue warmup_deep_gemm 用裸 except Exception 吞掉所有错误
module_base.py:120 except Exception as e: logging.warning(...) 吞掉所有异常。
[P.F] 语言陷阱 — 禁止模块级 import 副作用 → checklist-only
测试文件 test_deepgemm_warmup.py 在模块 import 阶段就 sys.modules 注入若干 stub 模块（line 527-546），属典型 import 副作用，但仅限测试范围、为支持 lazy import 的 mock；接受现状。

Strengths

deepgemm_warmup.py 模块顶端有清晰 docstring 说明两类 kernel、deepgemm_warmup 三态语义和触发链路
为 dense + MoE 两条路径设计了 (N,K) / (E,N,K) 维度的去重缓存，避免共享形状的 layer 重复 JIT
新增的 unit test 同时覆盖 mock 流（无 GPU 也能跑）和 real 流（带 cubin cache 命中校验），思路完整

LLLLKKKK · 2026-05-06T12:36:27Z

AI Code Review - PR #967

Status: LGTM

Summary: P0/0 · P1/0 · P2/3 · P3/2

lgtm ready to ci

Non-blocking Suggestions

P2

NormalEngine.cc 中 DeepGEMM warmup 与下方 warmUp() 的 gating 条件不一致 @ rtp_llm/cpp/normal_engine/NormalEngine.cc:78
- 建议：将 deepgemm warmup 的 gating 条件与下方原始 warmUp() 对齐:同样跳过 multimodal/ffn_disaggregate 路径,或在 commit 信息/注释中显式说明为何这两类部署仍需要预热 deepgemm。
warmup_deep_gemm 异常处理过窄,丢失非 ImportError/RuntimeError 失败的诊断 @ rtp_llm/models_py/model_desc/module_base.py:113
- 建议：将 except 改为捕获 Exception(并保留 exc_info=True),既符合 'non-fatal' 语义,也保留 python 侧详细诊断日志。
模块级 _FP8_GEMM_NT_WARMED / _GROUPED_GEMM_WARMED 缺少生命周期管理 @ rtp_llm/models_py/kernels/cuda/deepgemm_warmup.py:107
- 建议：将 warmed-set 改为按 model 实例(WeakValueDictionary 或 model id 维度)维护,或交给 deep_gemm 自身的 JIT 缓存判定 hit;同时为测试提供 reset_caches() 公开 helper,而不是测试代码反复 import 私有变量。

P3

JIT 缓存测试依赖 mtime 严格相等,易脆 @ rtp_llm/models_py/kernels/cuda/test/test_deepgemm_warmup.py:776
- 建议：改为断言 (a) 第二次 warmup 后没有出现新增的 cache 文件路径,以及 (b) cubin/.so 这类编译产物的 mtime 不变;不要对全部文件做 mtime 严格相等。或更好,直接 mock/spy deep_gemm 的编译入口,统计编译次数=0。
DEEPGEMM_WARMUP_MAX_TOKENS 硬编码 8192,超过该值的 M 仍会触发首请求 JIT @ rtp_llm/models_py/model_desc/module_base.py:106
- 建议：将上限暴露为 model_config 字段(类似 deepgemm_warmup),便于运维针对长上下文场景调整;或在 logger.info 中明确提示 m > MAX_TOKENS 的请求不会被预热。

Checklist Violations (7 fail / 84 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue 模块级 _FP8_GEMM_NT_WARMED / _GROUPED_GEMM_WARMED 缺少生命周期管理
_模块级 _FP8_GEMM_NT_WARMED / GROUPED_GEMM_WARMED 缺乏生命周期管理,测试需要手工 clear,见 issue。
[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue warmup_deep_gemm 异常处理过窄,丢失非 ImportError/RuntimeError 失败的诊断
module_base.py 仅捕 ImportError/RuntimeError,其它异常会逃逸到 C++ 层兜底,日志上下文丢失,见 issue。
[6.1] Tests — 边界 case 覆盖（空、单元素、最大值） → checklist-only
未对 max_tokens=0、模型不含任何 CudaFp8DeepGEMMLinear、deepgemm_warmup='0'/'full'/未知值这些分支显式断言,但属于补充覆盖,主链路有验证,故不升级为独立 issue。

RTP-LLM Checklist

[A] 兼容性与配置 — 默认值变更已评估对现有用户影响 → checklist-only
默认 '1' 启用 warmup 会延长引擎启动时间(数十秒到数分钟,取决于 shape 数量),建议在发布说明中明示;但属于性能/启动期成本,且可通过 deepgemm_warmup='0' 回退,故不单独立 issue。
[B] 正确性与逻辑 — 状态标志有完整 set/reset 生命周期 → issue 模块级 _FP8_GEMM_NT_WARMED / _GROUPED_GEMM_WARMED 缺少生命周期管理
__FP8_GEMM_NT_WARMED / GROUPED_GEMM_WARMED 缺少 reset 公开 API,见 issue。
[B] 正确性与逻辑 — Batch API 错误语义跨层一致 → issue NormalEngine.cc 中 DeepGEMM warmup 与下方 warmUp() 的 gating 条件不一致
NormalEngine 与下方 warmUp 的 gating 条件不一致,batch 启动期对多模态/ffn_disagg 行为不对齐,见 issue。

Python Static-First Checklist

[P.B] 错误处理 — 禁止 bare except 或静默吞异常 → issue warmup_deep_gemm 异常处理过窄,丢失非 ImportError/RuntimeError 失败的诊断
module_base.py 用 (ImportError, RuntimeError) 过窄,虽非 bare except,但范围窄到部分异常被 C++ 层兜底吃掉时丢失 python 栈,见 issue。

Strengths

对可选依赖采用了完整的 lazy import + has_deep_gemm() 双重保护,opensource 构建无 deep_gemm 时不会破坏
新增 ModelConfig 字段使用 Literal["0","1","full"] 标注且在 deep_gemm_warmup() 入口校验非法值并 fallback 到 '1',对外接口稳健
NormalEngine.cc 中 py::gil_scoped_acquire + py::hasattr 检查 + try/catch py::error_already_set 的 RAII 与异常隔离写法符合 pybind 最佳实践,失败被降级为 warning 不会阻塞引擎启动
测试同时覆盖 mock 路径(无 GPU 也能跑)与真实 GPU+deep_gemm 路径(skipUnless),并对 stub 模块用 sys.modules 注入,隔离上层依赖图

LLLLKKKK · 2026-05-07T04:38:21Z

AI Code Review - PR #967

Status: LGTM

Summary: P0/0 · P1/0 · P2/2 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

Hybrid MoE 连续布局没有被预热 @ rtp_llm/models_py/kernels/cuda/deepgemm_warmup.py:253
- 建议：为 DeepGemmHybridExecutor 增加 contiguous warmup 分支，构造 m_indices 并调用 m_grouped_fp8_gemm_nt_contiguous，或明确只预热 masked 路径。
真实 DeepGEMM 测试被模块级 stub 覆盖 @ rtp_llm/models_py/kernels/cuda/test/test_deepgemm_warmup.py:53
- 建议：把 sys.modules stub 放进 mock 用例的 patch.dict 作用域并在 real test 前恢复，或将 mock 测试与真实 GPU 测试拆成独立 target。

Checklist Violations (4 fail / 56 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue Hybrid MoE 连续布局没有被预热
MoE warmup 声称覆盖 contiguous，但只调用 masked wrapper，Hybrid contiguous JIT 条目未创建。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 真实 DeepGEMM 测试被模块级 stub 覆盖
real test 被模块级 sys.modules stub 污染，无法覆盖真实 DeepGEMM 实现。

Python Static-First Checklist

[P.F] 语言陷阱 — 禁止模块级 import 副作用 → issue 真实 DeepGEMM 测试被模块级 stub 覆盖
test_deepgemm_warmup.py 模块导入阶段直接写 sys.modules，影响同文件后续 real test。
[P.G] 测试规范 — mock.patch target 是使用处而非定义处 → issue 真实 DeepGEMM 测试被模块级 stub 覆盖
当前是全局替换 leaf module，不是 mock 用例作用域内 patch。

Strengths

NormalEngine 侧把 DeepGEMM warmup 放在 runtime_config.warm_up、多模态和 ffn_disaggregate gating 下，并显式持有 GIL。
deepgemm_warmup 配置使用 Literal 标注并校验非法值，可通过 "0" 关闭启动期预热。
新增 reset_warmed_caches() 后，测试和模型重载场景有明确缓存清理入口。

- rtp_llm/models_py/kernels/cuda/deepgemm_warmup.py (new): DeepGEMM JIT warmup implementation — pre-compiles fp8_gemm_nt (dense) and m_grouped_fp8_gemm_nt_masked (MoE) kernels for representative M values to avoid first-request nvcc latency. GroupedGemmWarmupWrapper encapsulates MoE dummy tensor construction behind a forward() interface. - rtp_llm/models_py/model_desc/module_base.py: Add warmup_deep_gemm() method to GptModelBase, reads deepgemm_warmup config and delegates to deepgemm_warmup.deep_gemm_warmup(). - rtp_llm/cpp/normal_engine/NormalEngine.cc: Call py_model.warmup_deep_gemm() in NormalEngine constructor before engine warmUp(), so JIT kernels are compiled before fake inference. - rtp_llm/config/model_config.py: Add deepgemm_warmup field to ModelConfig (1=enabled, 0=disabled, full=all M values). - rtp_llm/models_py/BUILD: Add visibility for kernels target to allow subpackage test access. - rtp_llm/models_py/kernels/cuda/test/BUILD + test_deepgemm_warmup.py (new): Unit tests verifying warmup calls fp8 linear forward() and MoE m_grouped_fp8_gemm_nt_masked, plus JIT cache file write/reuse.

P0/P1 (Blocking): GIL 获取 — NormalEngine.cc:81-90 问题: NormalEngine 构造函数在 LocalRpcServer 的 gil_scoped_release 作用域内调用 Python (py::hasattr, attr(...)(), e.what())，未持有 GIL，会导致崩溃或未定义行为。修复: 在调用 Python 前插入 py::gil_scoped_acquire gil;，且作用域覆盖整个 if 块（包括 catch 中的 e.what()），确保所有 Python API 调用都在 GIL 保护下。 P2: deepgemm warmup 受 runtime_config.warm_up 总开关约束 — NormalEngine.cc:81 问题: DeepGEMM warmup 独立于 runtime_config.warm_up 总开关，即使用户禁用 warmup 也会执行。修复: 将 runtime_config.warm_up 加入外层条件判断，与引擎 warmup 保持一致。deepgemm_warmup=0 仍作为细粒度禁用旋钮。 P2: 消除 8192 硬编码魔数 — module_base.py:113 问题: max_tokens 上限 8192 是裸 magic number。修复: 提取为类常量 DEEPGEMM_WARMUP_MAX_TOKENS = 8192，带注释说明用途。 P2: deepgemm_warmup 值校验 — model_config.py:529 + deepgemm_warmup.py:335-342 问题: deepgemm_warmup 是裸字符串 tri-state，写错值会静默回退。修复: - model_config.py: 类型标注改为 Literal[0, 1, full] - deepgemm_warmup.py: 在函数入口添加合法值校验，非法值打 warning 并回退到 1 P3: 收窄异常捕获 — module_base.py:124 问题: 裸 except Exception 吞掉所有异常，无法区分可恢复 vs 致命错误。修复: 改为 except (ImportError, RuntimeError)，仅捕获已知可能的异常（包缺失、CUDA 运行时错误）。其他异常向上抛出。同时添加 exc_info=True 让完整 traceback 进入日志。未修改项说明 - P3 模块级全局集合 (_FP8_GEMM_NT_WARMED/_GROUPED_GEMM_WARMED): 当前进程模型为单 model 实例，跨实例污染不会发生，保持现状。 - P3 BUILD 风格: 属代码风格问题，不影响功能，暂不修改。 - Checklist getattr → 直接属性访问: 已修复，改为 self.config.deepgemm_warmup 直接读取。

LLLLKKKK · 2026-05-07T08:07:00Z

AI Code Review - PR #967

Status: LGTM

Summary: P0/0 · P1/0 · P2/3 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

GPU JIT 测试未申请 GPU 执行环境 @ rtp_llm/models_py/kernels/cuda/test/BUILD:10
- 建议：按同目录 CUDA 测试补充 tags=["open_skip","H20"] 和 exec_properties={'gpu':'H20'}；如要覆盖 e8m0，再拆 SM100/SM100_ARM target。
SM100 e8m0 grouped warmup 使用了错误的 scale layout @ rtp_llm/models_py/kernels/cuda/deepgemm_warmup.py:149
- 建议：复用生产路径的 packed scale 构造方式（如 create_packed_scale_tensor 或等价 stride），并增加 SM100 e8m0 warmup 测试。
MaskedExecutorV2 会被误判为 Hybrid 并预热未使用的 contiguous kernels @ rtp_llm/models_py/kernels/cuda/deepgemm_warmup.py:123
- 建议：用 executor.executor_type()==ExecutorType.DEEPGEMM_CONTINUOUS 或显式排除 DeepGemmMaskedExecutorV2 来决定是否预热 contiguous。

Checklist Violations (6 fail / 56 total)

General Principles Checklist

[6.1] Software Engineering — DIP：高层策略不依赖非必要具体细节 → issue MaskedExecutorV2 会被误判为 Hybrid 并预热未使用的 contiguous kernels
GroupedGemmWarmupWrapper 用 isinstance(DeepGemmHybridExecutor) 判定 contiguous，继承关系与运行时 executor_type 不一致时会预热错误路径。
[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue MaskedExecutorV2 会被误判为 Hybrid 并预热未使用的 contiguous kernels
executor 是否需要 contiguous warmup 应由运行时 executor_type/execute 路径决定，当前用基类 isinstance 会破坏 masked V2 的路径不变量。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue GPU JIT 测试未申请 GPU 执行环境
新增测试需要 CUDA + deep_gemm，但 Bazel target 未申请 GPU executor，默认环境会 SkipTest，无法覆盖真实 JIT 路径。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue SM100 e8m0 grouped warmup 使用了错误的 scale layout
warmup 代码有 use_e8m0 分支，但新增测试只构造 float32 scale 且未拆 SM100 target，无法覆盖 e8m0 packed scale layout。

Python Static-First Checklist

[P.A] 静态结构与类型纪律 — 禁止 hasattr 做控制流分支 → checklist-only
_find_modules 使用 hasattr(attr, "dict") 做反射遍历；这是局部 discovery helper，不直接决定业务语义，暂不升级为 issue。
[P.B] 错误处理 — 禁止 bare except 或静默吞异常 → checklist-only
warmup_deep_gemm 捕获 Exception 但记录 warning+exc_info；因为 JIT warmup 明确是 non-fatal 启动优化，不作为独立缺陷升级。

Strengths

C++ 入口已受 runtime_config.warm_up、多模态和 ffn_disaggregate 条件约束，并在调用 Python 前显式获取 GIL。
deepgemm_warmup 配置使用 Literal 三态并在入口校验非法值，可选依赖保持 lazy import。
新增 reset_warmed_caches() 让测试和模型重载有明确的进程级缓存清理入口。

LLLLKKKK · 2026-05-07T08:54:39Z

AI Code Review - PR #967

Status: LGTM

Summary: P0/0 · P1/0 · P2/1 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

warmup 去重 key 忽略 M 范围和 layout @ rtp_llm/models_py/kernels/cuda/deepgemm_warmup.py:306
- 建议：按 shape 记录已预热的 M 集合和 masked/contiguous layout，或把 max_tokens、mode、layout 纳入去重 key，并补同 shape masked+hybrid 回归测试。

Checklist Violations (6 fail / 56 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue warmup 去重 key 忽略 M 范围和 layout
模块级 warmed set 只按 shape 标记完成，未表达已覆盖的 max_tokens、mode 和 contiguous layout。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue warmup 去重 key 忽略 M 范围和 layout
测试覆盖 JIT cache 写入复用，但 masked 与 hybrid 使用不同 shape，未覆盖同 shape 下 warmed set 跳过 contiguous 的风险。

Python Static-First Checklist

[P.A] 静态结构与类型纪律 — 禁止 hasattr 做控制流分支 → checklist-only
_find_modules 使用 hasattr(attr, 'dict') 做反射遍历；这是局部 discovery helper，不决定业务语义，暂不升级为独立问题。
[P.B] 错误处理 — 禁止 bare except 或静默吞异常 → checklist-only
warmup_deep_gemm 捕获 Exception，但记录 warning 且 exc_info=True；JIT warmup 是 non-fatal 启动优化，暂不作为独立缺陷。
[P.F] 语言陷阱 — 禁止模块级 import 副作用 → checklist-only
测试文件模块级调用 logging.basicConfig；影响限于测试 target，不触及生产 import，暂不升级为 issue。
[P.H] 类型标注 — 输入参数用 Sequence/Mapping/Iterable，返回值用具体类型 → checklist-only
__warmup_fp8_linear、warmup_moe_grouped_gemm、deep_gemm_warmup 缺少 -> None；属于类型完整性改进，不影响当前行为。

Strengths

C++ 入口已复用 runtime_config.warm_up、多模态和 ffn_disaggregate gating，并在调用 Python 前显式持有 GIL。
DeepGEMM 可选依赖保持 lazy import，配置提供 '0'、'1'、'full' 三态且入口会校验非法值。
新增 GPU 测试申请 H20 executor，并覆盖真实 JIT cache 写入与二次复用。

ydshi0 requested a review from LLLLKKKK as a code owner May 6, 2026 08:55

ydshi0 added 2 commits May 7, 2026 15:55

ydshi0 force-pushed the features/deepgemm_warmup branch from 90bc27d to a8c8256 Compare May 7, 2026 07:58

fix _find_modules, remove mock tests, add DeepGemmHybridExecutor warmup

a28b3a1

ydshi0 force-pushed the features/deepgemm_warmup branch from a8c8256 to a28b3a1 Compare May 7, 2026 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add DeepGEMM JIT kernel warmup integrated into C++ engine startup#967

feat: add DeepGEMM JIT kernel warmup integrated into C++ engine startup#967
ydshi0 wants to merge 3 commits intoalibaba:mainfrom
ydshi0:features/deepgemm_warmup

ydshi0 commented May 6, 2026

Uh oh!

CLAassistant commented May 6, 2026 •

edited

Loading

Uh oh!

LLLLKKKK commented May 6, 2026

Uh oh!

LLLLKKKK commented May 6, 2026

Uh oh!

LLLLKKKK commented May 7, 2026

Uh oh!

LLLLKKKK commented May 7, 2026

Uh oh!

LLLLKKKK commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ydshi0 commented May 6, 2026

Uh oh!

CLAassistant commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LLLLKKKK commented May 6, 2026

AI Code Review - PR #967

Blocking Issues

P0

P1

Non-blocking Suggestions

P2

P3

Checklist Violations (12 fail / 84 total)

Strengths

Uh oh!

LLLLKKKK commented May 6, 2026

AI Code Review - PR #967

Non-blocking Suggestions

P2

P3

Checklist Violations (7 fail / 84 total)

Strengths

Uh oh!

LLLLKKKK commented May 7, 2026

AI Code Review - PR #967

Non-blocking Suggestions

P2

Checklist Violations (4 fail / 56 total)

Strengths

Uh oh!

LLLLKKKK commented May 7, 2026

AI Code Review - PR #967

Non-blocking Suggestions

P2

Checklist Violations (6 fail / 56 total)

Strengths

Uh oh!

LLLLKKKK commented May 7, 2026

AI Code Review - PR #967

Non-blocking Suggestions

P2

Checklist Violations (6 fail / 56 total)

Strengths

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented May 6, 2026 •

edited

Loading