Skip to content

Add Qwen3.5 model support (27B dense and 35B-A3B MoE)#1641

Merged
zhuzilin merged 1 commit intomainfrom
feature/qwen3_5
Feb 28, 2026
Merged

Add Qwen3.5 model support (27B dense and 35B-A3B MoE)#1641
zhuzilin merged 1 commit intomainfrom
feature/qwen3_5

Conversation

@zhuzilin
Copy link
Copy Markdown
Contributor

@zhuzilin zhuzilin commented Feb 28, 2026

Need to upgrade to transformers 0.5.2 manually for qwen3.5 support.

  • New model plugin: slime_plugins/models/qwen3_5.py

    • Qwen3_5GatedDeltaNet with separate QKV/Z projections, conv1d, and flat QKV split
    • get_qwen3_5_spec replacing standard attention with linear attention per layer_types
  • New weight bridge: slime_plugins/mbridge/qwen3_5.py

    • Handles VLM weight prefix (model.language_model.layers)
    • Fused expert weight format for MoE (3D tensors -> per-expert slices)
    • MTP layer support with individual expert format
  • New HF converter: slime/backends/megatron_utils/megatron_to_hf/qwen3_5.py

    • TEGroupedMLP per-expert weight{i} -> HF fused expert format
    • Proper gate/up split for swiglu experts
  • Fix sglang_rollout.py: skip processor for text-only VLM models

  • Model configs and run scripts for both 27B and 35B-A3B

Tested: Both models verified end-to-end with training.

  • 27B: TP=1 SGLang (8 engines), TP=2/PP=2/CP=2 Megatron, logprob_diff=0.017
  • 35B-A3B: TP=2 SGLang (4 engines), EP=8 Megatron, logprob_diff=0.012

- New model plugin: slime_plugins/models/qwen3_5.py
  - Qwen3_5GatedDeltaNet with separate QKV/Z projections, conv1d, and flat QKV split
  - get_qwen3_5_spec replacing standard attention with linear attention per layer_types

- New weight bridge: slime_plugins/mbridge/qwen3_5.py
  - Handles VLM weight prefix (model.language_model.layers)
  - Fused expert weight format for MoE (3D tensors -> per-expert slices)
  - MTP layer support with individual expert format

- New HF converter: slime/backends/megatron_utils/megatron_to_hf/qwen3_5.py
  - TEGroupedMLP per-expert weight{i} -> HF fused expert format
  - Proper gate/up split for swiglu experts

- Fix sglang_rollout.py: skip processor for text-only VLM models

- Model configs and run scripts for both 27B and 35B-A3B

Tested: Both models verified end-to-end with training.
- 27B: TP=1 SGLang (8 engines), TP=2/PP=2/CP=2 Megatron, logprob_diff=0.017
- 35B-A3B: TP=2 SGLang (4 engines), EP=8 Megatron, logprob_diff=0.012

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@zhuzilin zhuzilin merged commit 55828f3 into main Feb 28, 2026
2 checks passed
@zhuzilin zhuzilin deleted the feature/qwen3_5 branch February 28, 2026 06:36
rohin-garg pushed a commit to aimosprite/slime that referenced this pull request Feb 28, 2026
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dongyuanjushi pushed a commit to dongyuanjushi/slime that referenced this pull request Mar 18, 2026
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant