Default DataLoader multiprocessing context to spawn by 0o8o0-blip · Pull Request #3520 · huggingface/lerobot

0o8o0-blip · 2026-05-06T08:54:55Z

Summary

Make the DataLoader multiprocessing start method configurable on TrainPipelineConfig and default it to \"spawn\". Previously the default was the platform default (fork on Linux), which is unsafe with libraries that hold non-fork-safe state in the parent — notably PyAV, torchcodec, and the ffmpeg shared libs they wrap.

Closes #2488 and addresses one of the worker-failure modes in #2209.

What breaks today (fork default)

Reproducible on RTX 3060 + Ubuntu 22.04 + torch 2.10 + lerobot main, training ACT on a small video dataset with --num_workers=2:

multiprocessing.context.AuthenticationError: digest received was wrong — appears mid-epoch when a forked worker re-uses the parent's multiprocessing.authkey state in a way the parent has invalidated
RuntimeError: Pin memory thread exited unexpectedly — pin-memory thread shares fork-time CUDA state with the parent and crashes when reinitializing
RuntimeError: DataLoader worker (pid X) exited unexpectedly — typical wrapper for either of the above
Random SIGSEGV inside the worker during video decode (PyAV / torchcodec / ffmpeg are not fork-safe)

Empirically: 5/5 launches with the current default crash before the first checkpoint. After this change: 1/1 clean.

Why spawn is safe to default

Workers re-import modules cleanly, so PyAV/torchcodec/ffmpeg get fresh state per worker rather than inheriting the parent's
It's already what we'd recommend in the existing issues
Users who specifically need fork (faster worker startup, smaller memory footprint) can still set --dataloader-multiprocessing-context=fork

The startup cost on small datasets is ~5–10 s of extra worker spinup once per run, paid back the first time you'd otherwise have hit one of the failure modes above.

Test plan

Existing unit tests pass
`lerobot.scripts.lerobot_train --num_workers=2 --policy.type=act ...` reaches first checkpoint without worker errors
`--dataloader-multiprocessing-context=fork` still works on platforms where users want it

Make the DataLoader multiprocessing start method configurable on TrainPipelineConfig and default it to 'spawn'. The previous default (fork on Linux) is unsafe with libraries that hold non-fork-safe state in the parent process — common ones in this codebase are PyAV, torchcodec, and the ffmpeg shared libs they wrap. Symptoms reported in huggingface#2488, huggingface#2209, and observed locally include: - multiprocessing.context.AuthenticationError: digest received was wrong - RuntimeError: Pin memory thread exited unexpectedly - RuntimeError: DataLoader worker exited unexpectedly - Random SIGSEGV inside worker processes during video decode Switching to spawn re-imports modules cleanly in each worker and eliminates these failure modes. Added the setting as a config field rather than hard-coding so users on platforms where fork is preferred can opt back in via --dataloader-multiprocessing-context=fork.

github-actions Bot added the configuration Problems with configuration files or settings label May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default DataLoader multiprocessing context to spawn#3520

Default DataLoader multiprocessing context to spawn#3520
0o8o0-blip wants to merge 1 commit intohuggingface:mainfrom
0o8o0-blip:dataloader-spawn-default

0o8o0-blip commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0o8o0-blip commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What breaks today (fork default)

Why spawn is safe to default

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

0o8o0-blip commented May 6, 2026 •

edited

Loading