I'm currently ORPO Qwen3-4B model with 40960 context length. I'm using NVIDIA 4xL40. But this requires far more memory than 48GB. I also tried deepspeed zero3 and fsdp2 but with 4 GPUs, it became dp = 4 and can't solve OOM caused by long context. Can I do dp = 1 or dp = 2 with 4 GPU with ms-swift?
I also try to use Megatron, but Megatron doesn't support ORPO.
I'm currently ORPO Qwen3-4B model with 40960 context length. I'm using NVIDIA 4xL40. But this requires far more memory than 48GB. I also tried deepspeed zero3 and fsdp2 but with 4 GPUs, it became dp = 4 and can't solve OOM caused by long context. Can I do dp = 1 or dp = 2 with 4 GPU with ms-swift?
I also try to use Megatron, but Megatron doesn't support ORPO.