Skip to content

Enable DeepSpeed ZeRO-3 for rollout script #38

@XiangZhang-zx

Description

@XiangZhang-zx

Hi team,

When running sample/llada_rl_rollout.py, I encountered an OOM issue when using DeepSpeed ZeRO-3 together with LoRA and modules_to_save.
It seems that the current rollout script may not fully support ZeRO-3, or may require additional configuration to handle the increased memory footprint introduced by modules_to_save (e.g., keeping wte and ff_out trainable).

Could you please confirm whether ZeRO-3 is officially supported for rollout (and if so, what the correct setup is)?
If not currently supported, it would be great to include guidance or example configs for using llada_rl_rollout.py with ZeRO-3 in the documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions