Enable DeepSpeed ZeRO-3 for rollout script

Hi team,

When running sample/llada_rl_rollout.py, I encountered an OOM issue when using DeepSpeed ZeRO-3 together with LoRA and modules_to_save.
It seems that the current rollout script may not fully support ZeRO-3, or may require additional configuration to handle the increased memory footprint introduced by modules_to_save (e.g., keeping wte and ff_out trainable).

Could you please confirm whether ZeRO-3 is officially supported for rollout (and if so, what the correct setup is)?
If not currently supported, it would be great to include guidance or example configs for using llada_rl_rollout.py with ZeRO-3 in the documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable DeepSpeed ZeRO-3 for rollout script #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable DeepSpeed ZeRO-3 for rollout script #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions