Skip to content

Default DataLoader multiprocessing context to spawn#3520

Open
0o8o0-blip wants to merge 1 commit intohuggingface:mainfrom
0o8o0-blip:dataloader-spawn-default
Open

Default DataLoader multiprocessing context to spawn#3520
0o8o0-blip wants to merge 1 commit intohuggingface:mainfrom
0o8o0-blip:dataloader-spawn-default

Conversation

@0o8o0-blip
Copy link
Copy Markdown

@0o8o0-blip 0o8o0-blip commented May 6, 2026

Summary

Make the DataLoader multiprocessing start method configurable on TrainPipelineConfig and default it to \"spawn\". Previously the default was the platform default (fork on Linux), which is unsafe with libraries that hold non-fork-safe state in the parent — notably PyAV, torchcodec, and the ffmpeg shared libs they wrap.

Closes #2488 and addresses one of the worker-failure modes in #2209.

What breaks today (fork default)

Reproducible on RTX 3060 + Ubuntu 22.04 + torch 2.10 + lerobot main, training ACT on a small video dataset with --num_workers=2:

  • multiprocessing.context.AuthenticationError: digest received was wrong — appears mid-epoch when a forked worker re-uses the parent's multiprocessing.authkey state in a way the parent has invalidated
  • RuntimeError: Pin memory thread exited unexpectedly — pin-memory thread shares fork-time CUDA state with the parent and crashes when reinitializing
  • RuntimeError: DataLoader worker (pid X) exited unexpectedly — typical wrapper for either of the above
  • Random SIGSEGV inside the worker during video decode (PyAV / torchcodec / ffmpeg are not fork-safe)

Empirically: 5/5 launches with the current default crash before the first checkpoint. After this change: 1/1 clean.

Why spawn is safe to default

  • Workers re-import modules cleanly, so PyAV/torchcodec/ffmpeg get fresh state per worker rather than inheriting the parent's
  • It's already what we'd recommend in the existing issues
  • Users who specifically need fork (faster worker startup, smaller memory footprint) can still set --dataloader-multiprocessing-context=fork

The startup cost on small datasets is ~5–10 s of extra worker spinup once per run, paid back the first time you'd otherwise have hit one of the failure modes above.

Test plan

  • Existing unit tests pass
  • `lerobot.scripts.lerobot_train --num_workers=2 --policy.type=act ...` reaches first checkpoint without worker errors
  • `--dataloader-multiprocessing-context=fork` still works on platforms where users want it

Make the DataLoader multiprocessing start method configurable on
TrainPipelineConfig and default it to 'spawn'.

The previous default (fork on Linux) is unsafe with libraries that hold
non-fork-safe state in the parent process — common ones in this codebase
are PyAV, torchcodec, and the ffmpeg shared libs they wrap. Symptoms
reported in huggingface#2488, huggingface#2209, and observed locally include:

- multiprocessing.context.AuthenticationError: digest received was wrong
- RuntimeError: Pin memory thread exited unexpectedly
- RuntimeError: DataLoader worker exited unexpectedly
- Random SIGSEGV inside worker processes during video decode

Switching to spawn re-imports modules cleanly in each worker and
eliminates these failure modes. Added the setting as a config field
rather than hard-coding so users on platforms where fork is preferred
can opt back in via --dataloader-multiprocessing-context=fork.
@github-actions github-actions Bot added the configuration Problems with configuration files or settings label May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

configuration Problems with configuration files or settings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multiprocessing leads to decoding error

1 participant