Skip to content

Handle Long Context in Multi-Gpu #8382

@jeff4700

Description

@jeff4700

I'm currently ORPO Qwen3-4B model with 40960 context length. I'm using NVIDIA 4xL40. But this requires far more memory than 48GB. I also tried deepspeed zero3 and fsdp2 but with 4 GPUs, it became dp = 4 and can't solve OOM caused by long context. Can I do dp = 1 or dp = 2 with 4 GPU with ms-swift?

I also try to use Megatron, but Megatron doesn't support ORPO.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions