Default DataLoader multiprocessing context to spawn#3520
Open
0o8o0-blip wants to merge 1 commit intohuggingface:mainfrom
Open
Default DataLoader multiprocessing context to spawn#35200o8o0-blip wants to merge 1 commit intohuggingface:mainfrom
0o8o0-blip wants to merge 1 commit intohuggingface:mainfrom
Conversation
Make the DataLoader multiprocessing start method configurable on TrainPipelineConfig and default it to 'spawn'. The previous default (fork on Linux) is unsafe with libraries that hold non-fork-safe state in the parent process — common ones in this codebase are PyAV, torchcodec, and the ffmpeg shared libs they wrap. Symptoms reported in huggingface#2488, huggingface#2209, and observed locally include: - multiprocessing.context.AuthenticationError: digest received was wrong - RuntimeError: Pin memory thread exited unexpectedly - RuntimeError: DataLoader worker exited unexpectedly - Random SIGSEGV inside worker processes during video decode Switching to spawn re-imports modules cleanly in each worker and eliminates these failure modes. Added the setting as a config field rather than hard-coding so users on platforms where fork is preferred can opt back in via --dataloader-multiprocessing-context=fork.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make the
DataLoadermultiprocessing start method configurable onTrainPipelineConfigand default it to\"spawn\". Previously the default was the platform default (forkon Linux), which is unsafe with libraries that hold non-fork-safe state in the parent — notably PyAV, torchcodec, and the ffmpeg shared libs they wrap.Closes #2488 and addresses one of the worker-failure modes in #2209.
What breaks today (fork default)
Reproducible on RTX 3060 + Ubuntu 22.04 + torch 2.10 + lerobot main, training ACT on a small video dataset with
--num_workers=2:multiprocessing.context.AuthenticationError: digest received was wrong— appears mid-epoch when a forked worker re-uses the parent's multiprocessing.authkey state in a way the parent has invalidatedRuntimeError: Pin memory thread exited unexpectedly— pin-memory thread shares fork-time CUDA state with the parent and crashes when reinitializingRuntimeError: DataLoader worker (pid X) exited unexpectedly— typical wrapper for either of the aboveEmpirically: 5/5 launches with the current default crash before the first checkpoint. After this change: 1/1 clean.
Why spawn is safe to default
--dataloader-multiprocessing-context=forkThe startup cost on small datasets is ~5–10 s of extra worker spinup once per run, paid back the first time you'd otherwise have hit one of the failure modes above.
Test plan