fix - make prepare_cg_spec_decode_kernel easy use and understand by zerozw · Pull Request #954 · alibaba/rtp-llm

zerozw · 2026-04-29T09:52:57Z

No description provided.

LLLLKKKK · 2026-04-29T10:02:07Z

AI Code Review - PR #954

Status: LGTM

Summary: P0/0 · P1/0 · P2/0 · P3/1

lgtm ready to ci

Non-blocking Suggestions

P3

测试名 padding_rows_stay_zero 与实际语义略有出入 @ rtp_llm/models_py/modules/factory/attention/cuda_impl/test/trtllm_gen_test.py:412
- 建议：重命名为 test_mode1_padding_rows_become_zero,或在断言前补充注释说明 padding 行被显式写入 0(prefix+q_len=0+0)。

Review Checklist: 9 pass / 0 fail

General Principles Checklist

Passed

[PASS] 6.1 Software Engineering: 改动仅围绕单一职责 (spec-decode kernel 的 q_len 加载语义),未夹带无关重构
[PASS] 6.1 Architecture: 调用方契约保持兼容,边界处行为变化已显式覆盖
[PASS] 6.1 Tests: 新逻辑有针对性单元测试覆盖关键边界 (padding 行)
[PASS] 6.1 Quality: 逻辑改动未与无关 formatting 混合,docstring 与代码语义同步更新

RTP-LLM Checklist

Passed

[PASS] B Correctness And Logic: kernel 输入加载需正确处理 batch 维度与 padding/掩码
[PASS] D Performance: 性能热路径 (CUDA graph prepare) 的改动是否合理
[PASS] H Tests And CI: 修改 kernel 行为后,既有测试与新增测试均跟随更新

Python Static-First Checklist

Passed

[PASS] P.A Static Correctness: 无未使用 import / 未定义变量 / 类型不一致
[PASS] P.E Tests: 测试用例独立、可重复、不依赖外部状态

Strengths

精准定位 CUDA-graph padding 行 q_len 被广播为首请求 q_len 导致 seq_lens_out 错误的根因,并通过逐行 tl.load + mask + other=0 修正,改动最小化。
docstring 同步更新,清楚说明 valid spec 请求共享 q_len 但 padding 行需要独立读取,why 写得清晰,避免后续误回滚。
新增 test_mode1_padding_rows_stay_zero 直接覆盖 padding 行场景,且同步把 mode1 既有用例的 q_len_tensor 从 1 元素改成 N 元素,与新 kernel 的逐行加载契约保持一致,避免越界 mask-load 误读。

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

wht21 · 2026-04-30T02:13:50Z

internal source has been updated, please review the changes!

wht21 · 2026-04-30T05:54:18Z

internal source has been updated, please review the changes!

LLLLKKKK · 2026-04-30T06:22:04Z

AI Code Review - PR #954

Status: LGTM

Summary: P0/0 · P1/0 · P2/0 · P3/0

lgtm ready to ci

Review Checklist: 9 pass / 0 fail

General Principles Checklist

Passed

[PASS] 6.1 Software Engineering: 修改集中、不夹杂无关重构，docstring 与行为同步更新
[PASS] 6.1 Architecture: 状态/不变量在 padding 路径上保持有效
[PASS] 6.1 Tests: 为修复行为新增针对性回归用例并覆盖边界
[PASS] 6.1 Quality: 无 hot-path 调试日志、无 mega-PR、提交聚焦

RTP-LLM Checklist

Passed

[PASS] B Correctness & Logic (kernel/cache): Triton kernel 加载/存储的 mask 与 ptr 偏移正确
[PASS] D Performance & CUDA Graph: CUDA-graph 准备路径行为与捕获/重放一致
[PASS] H Tests & CI: 新增/修改测试可在现有 PrepareCudaGraphKernelTest 框架下运行

Python Static-First Checklist

Passed

[PASS] P.A Static-first checks: 类型与张量构造正确、无未使用变量
[PASS] P.B Correctness in Python tests: 断言覆盖修复目标且不依赖外部状态

Strengths

针对 CUDA-graph spec-decode 准备 kernel 的真实缺陷做了小而准的修复：之前 tl.load(q_len_ptr) 只读 q_len[0] 并广播到所有行，导致 padding 行（prefix=0）被错误地写成 q_len[0]。改为按 offsets_n 带 mask 读取后，padding 行可以保持为 0。
docstring 同步说明了为什么需要 per-row 读取（CUDA-graph padding 行单独清零），让后续读者可以理解约束而不是把它当成多余泛化。
新增 test_mode1_padding_rows_stay_zero 用 prefix=[100,200,0,0] / q_len=[5,5,0,0] 直接锁定回归路径；同时把 test_mode1_base / test_mode1_large_batch 的 q_len 由单元素改为 torch.full((N,), ...)，与新的 kernel 读取语义保持一致。
调用方 FlashInferTRTLLMSpecDecodeImpl.prepare_cuda_graph 传入的是 attn_inputs.input_lengths_d（按 batch 长度），N 个 mask 内读取均在界内，无新增越界风险。

LLLLKKKK · 2026-05-06T02:13:17Z

AI Code Review - PR #954

Status: LGTM

Summary: P0/0 · P1/0 · P2/0 · P3/0

lgtm ready to ci

Checklist ✅ (56 items passed)

Strengths

按行读取 q_len 修复了 CUDA Graph spec-decode 准备阶段 padding 行不正确继承首请求 q_len 的语义 bug，prefix/q_len 同步加 other=0 与 mask 对齐
新增 test_mode1_padding_rows_stay_zero 直接覆盖 padding 行保持 0 的回归路径，断言精确（[105,205,0,0]）
原有 test_mode1_base / test_mode1_large_batch 同步改为 torch.full((N,), q_len)，与新内核契约一致，避免遗留 1 元素 tensor 误用
kernel docstring 明确说明 padding 行单独清零的动机，便于后续维护者理解为何不能再读 q_len_ptr[0]

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        q_len = tl.load(q_len_ptr + offsets_n, mask=mask_n, other=0)
+        prefix = tl.load(prefix_ptr + offsets_n, mask=mask_n, other=0)


        mask_n = offsets_n < N
-        q_len = tl.load(q_len_ptr)
-        prefix = tl.load(prefix_ptr + offsets_n, mask=mask_n)
+        q_len = tl.load(q_len_ptr + offsets_n, mask=mask_n, other=0)


Copilot AI review requested due to automatic review settings April 29, 2026 09:52

zerozw requested a review from LLLLKKKK as a code owner April 29, 2026 09:52

LLLLKKKK approved these changes Apr 29, 2026

View reviewed changes

LLLLKKKK enabled auto-merge (rebase) April 29, 2026 10:02

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Copilot started reviewing on behalf of zerozw April 29, 2026 11:13 View session

zerozw force-pushed the fix/fix_trtllm branch from f222d7a to d3e5d05 Compare April 30, 2026 06:18

fix - make prepare_cg_spec_decode_kernel easy use and understand

6830966

Copilot AI review requested due to automatic review settings May 6, 2026 02:09

zerozw force-pushed the fix/fix_trtllm branch from d3e5d05 to 6830966 Compare May 6, 2026 02:09

Copilot AI reviewed May 6, 2026

View reviewed changes

Copilot started reviewing on behalf of zerozw May 6, 2026 02:27 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix - make prepare_cg_spec_decode_kernel easy use and understand#954

fix - make prepare_cg_spec_decode_kernel easy use and understand#954
zerozw wants to merge 1 commit intomainfrom
fix/fix_trtllm

zerozw commented Apr 29, 2026

Uh oh!

LLLLKKKK commented Apr 29, 2026

General Principles Checklist

RTP-LLM Checklist

Python Static-First Checklist

Uh oh!

Copilot AI left a comment

Uh oh!

wht21 commented Apr 30, 2026

Uh oh!

wht21 commented Apr 30, 2026

Uh oh!

LLLLKKKK commented Apr 30, 2026

General Principles Checklist

RTP-LLM Checklist

Python Static-First Checklist

Uh oh!

LLLLKKKK commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		q_len = tl.load(q_len_ptr + offsets_n, mask=mask_n, other=0)
		prefix = tl.load(prefix_ptr + offsets_n, mask=mask_n, other=0)

Conversation

zerozw commented Apr 29, 2026

Uh oh!

LLLLKKKK commented Apr 29, 2026

AI Code Review - PR #954

Non-blocking Suggestions

P3

General Principles Checklist

RTP-LLM Checklist

Python Static-First Checklist

Strengths

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

wht21 commented Apr 30, 2026

Uh oh!

wht21 commented Apr 30, 2026

Uh oh!

LLLLKKKK commented Apr 30, 2026

AI Code Review - PR #954

General Principles Checklist

RTP-LLM Checklist

Python Static-First Checklist

Strengths

Uh oh!

LLLLKKKK commented May 6, 2026

AI Code Review - PR #954

Checklist ✅ (56 items passed)

Strengths

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants