Handle Long Context in Multi-Gpu

I'm currently ORPO Qwen3-4B model with 40960 context length. I'm using NVIDIA 4xL40. But this requires far more memory than 48GB. I also tried deepspeed zero3 and fsdp2 but with 4 GPUs, it became dp = 4 and can't solve OOM caused by long context. Can I do dp = 1 or dp = 2 with 4 GPU with ms-swift?

I also try to use Megatron, but Megatron doesn't support ORPO.