Scheduler deadlocks after first batch of parallel Python assets — no error, no further scheduling
Summary
bruin run reliably hangs after dispatching exactly one asset beyond the initial worker-pool fill, regardless of --workers value. The scheduler logs started the scheduler loop once at startup, fills the initial worker slots, dispatches exactly one follow-up asset as slots free, then emits no further output forever.
The pattern is independent of concurrency level:
--workers |
Assets completed before hang |
| 16 (default) |
17 (first 16 + 1 follow-up) |
| 1 |
2 (first 1 + 1 follow-up) |
This is not a parallelism race — a single-worker run hangs after 2 assets. It looks like the completion handler processes exactly one completion event per scheduler lifetime and then stops listening.
All spawned Python subprocesses finish cleanly and exit; the parent bruin process stays alive at 0% CPU with no open network sockets and no child processes, but never schedules additional work. No error, exception, panic, or "task failed" event is logged. Ctrl+C does not interrupt it — the user has to kill <pid> externally.
Environment
- Bruin version:
v0.11.544 (commit 51264459f35c3abbd3af804ef674371c60a5d349) — current latest as of this report
- OS: macOS on Apple Silicon (Darwin 25.4.0,
arm64 / aarch64-apple-darwin)
- Longevity: also reproduced on the prior installed
bruin version; upgrading to v0.11.544 did not fix it. Not a fresh regression — this has been present for at least the last few versions the user has installed.
- Pipeline size: 449 assets (69 Python bronze + ~20 other bronze + ~300 SQL silver/gold/staging)
- Command:
bruin --debug run --verbose -x load -x expensive .
- Destinations: BigQuery (SQL), GCS (parquet uploads from Python assets)
- Python asset workload:
pandas.read_sql from SQL Server via pymssql → save_to_parquet to GCS → create_external_table in BigQuery
Reproduction
- Pipeline with 449 assets, of which ~90 are Python type and the rest SQL views. Python assets call
pd.read_sql(...), upload parquet to GCS, and create BigQuery external tables. Each Python bronze asset is independent (no depends: between them; SQL silver/gold assets depend on them).
- Run:
bruin --debug run --verbose -x load -x expensive . 2>&1 | tee /tmp/bruin-debug.log
- Observe: exactly 17 Python assets run (16 concurrent workers + 1 follow-up as a slot frees). All 17 emit
Finished: ... log lines.
- After the last
Finished: event, the log goes silent. The scheduler does not spawn any new Python assets, nor does it begin any of the eligible SQL view assets.
- Repeat with
bruin --debug run --verbose --workers 1 -t bronze -x expensive . — same hang, after 2 completed assets.
- Process state after the hang (confirmed via
ps, lsof, pgrep):
STAT: S+ (sleeping in foreground)
%CPU: 0.0
- Children: none
- Open network sockets: none (only internal pipes)
- The hang is deterministic across runs.
Debug log: the smoking gun
With --workers 1 --debug, the debug log shows exactly one scheduler.go:722 received task result: X event per scheduler lifetime:
20:46:36 DEBUG scheduler/scheduler.go:715 started the scheduler loop
20:46:36 DEBUG python/operator.go:97 running Python asset bronze.colleague.acad_program_reqmts ...
[20:47:03] Finished: bronze.colleague.acad_program_reqmts (26.114s)
20:47:03 DEBUG scheduler/scheduler.go:722 received task result: bronze.colleague.acad_program_reqmts ← fires
20:47:03 DEBUG python/operator.go:97 running Python asset bronze.colleague.acad_program_reqmts_ls ...
[20:47:08] Finished: bronze.colleague.acad_program_reqmts_ls (5.262s)
<<< NO "received task result" for acad_program_reqmts_ls >>>
<<< scheduler silent for the rest of the run >>>
Task 1 completes → scheduler receives the result → dispatches task 2.
Task 2 completes → scheduler never receives the result → nothing further is dispatched.
The user-facing Finished: line appears (emitted by the operator itself when the subprocess exits), so we know the Python worker exited normally. But scheduler.go:722 only fires once per scheduler lifetime — suggesting the result-channel is drained once and then not re-read, or the receiver goroutine exits after one iteration.
With --workers 16, the same pattern holds: the initial 16 workers spawn simultaneously, one task's result is processed (triggering a 17th dispatch), and then subsequent completions are silently dropped.
Expected behavior
After a Python asset finishes, the scheduler should:
- Mark the task complete in the task graph
- Re-evaluate the DAG for newly-eligible assets
- Dispatch eligible work to freed worker slots
Observed behavior
Worker slots free up (confirmed — 17 Python processes exit cleanly with complete GCS parquet files and BigQuery external tables) but the scheduler never dispatches further work. Process stays alive at 0% CPU indefinitely.
Workarounds tried
- Default
--workers 16: hangs after 17 completions.
--workers 1: hangs after 2 completions.
bruin clean (wipes uv caches and logs) + rm -rf __pycache__: no effect — same hang, same debug signature.
- Bypassing the scheduler entirely: running
bruin run <single-asset>.py in a shell loop works fine (each invocation is a fresh bruin process with its own scheduler, each of which happily runs its one asset and exits).
Hypothesis
The debug log at scheduler.go:722 confirms a one-shot completion receiver: the log line received task result: X appears exactly once per run, regardless of how many tasks actually finish. Task 1's completion is received and acted on (triggering the next dispatch); task 2's completion is silently dropped.
This is consistent with:
- A result channel being drained via a single
<-ch rather than a loop
- A
select statement missing a re-arm
- A receiver goroutine that
returns after the first iteration
The --workers 1 case rules out any worker-pool race, concurrency contention, or channel-capacity issue. This is a deterministic state-machine/control-flow bug in the scheduler's result-handling code path.
Happy to provide a minimal repro, additional debug output, goroutine dumps (SIGQUIT), or pprof profiles on request.
Scheduler deadlocks after first batch of parallel Python assets — no error, no further scheduling
Summary
bruin runreliably hangs after dispatching exactly one asset beyond the initial worker-pool fill, regardless of--workersvalue. The scheduler logsstarted the scheduler looponce at startup, fills the initial worker slots, dispatches exactly one follow-up asset as slots free, then emits no further output forever.The pattern is independent of concurrency level:
--workersThis is not a parallelism race — a single-worker run hangs after 2 assets. It looks like the completion handler processes exactly one completion event per scheduler lifetime and then stops listening.
All spawned Python subprocesses finish cleanly and exit; the parent
bruinprocess stays alive at 0% CPU with no open network sockets and no child processes, but never schedules additional work. No error, exception, panic, or "task failed" event is logged.Ctrl+Cdoes not interrupt it — the user has tokill <pid>externally.Environment
v0.11.544(commit51264459f35c3abbd3af804ef674371c60a5d349) — current latest as of this reportarm64/aarch64-apple-darwin)bruinversion; upgrading tov0.11.544did not fix it. Not a fresh regression — this has been present for at least the last few versions the user has installed.bruin --debug run --verbose -x load -x expensive .pandas.read_sqlfrom SQL Server via pymssql →save_to_parquetto GCS →create_external_tablein BigQueryReproduction
pd.read_sql(...), upload parquet to GCS, and create BigQuery external tables. Each Python bronze asset is independent (nodepends:between them; SQL silver/gold assets depend on them).bruin --debug run --verbose -x load -x expensive . 2>&1 | tee /tmp/bruin-debug.logFinished: ...log lines.Finished:event, the log goes silent. The scheduler does not spawn any new Python assets, nor does it begin any of the eligible SQL view assets.bruin --debug run --verbose --workers 1 -t bronze -x expensive .— same hang, after 2 completed assets.ps,lsof,pgrep):STAT:S+(sleeping in foreground)%CPU: 0.0Debug log: the smoking gun
With
--workers 1 --debug, the debug log shows exactly onescheduler.go:722 received task result: Xevent per scheduler lifetime:Task 1 completes → scheduler receives the result → dispatches task 2.
Task 2 completes → scheduler never receives the result → nothing further is dispatched.
The user-facing
Finished:line appears (emitted by the operator itself when the subprocess exits), so we know the Python worker exited normally. But scheduler.go:722 only fires once per scheduler lifetime — suggesting the result-channel is drained once and then not re-read, or the receiver goroutine exits after one iteration.With
--workers 16, the same pattern holds: the initial 16 workers spawn simultaneously, one task's result is processed (triggering a 17th dispatch), and then subsequent completions are silently dropped.Expected behavior
After a Python asset finishes, the scheduler should:
Observed behavior
Worker slots free up (confirmed — 17 Python processes exit cleanly with complete GCS parquet files and BigQuery external tables) but the scheduler never dispatches further work. Process stays alive at 0% CPU indefinitely.
Workarounds tried
--workers 16: hangs after 17 completions.--workers 1: hangs after 2 completions.bruin clean(wipes uv caches and logs) +rm -rf __pycache__: no effect — same hang, same debug signature.bruin run <single-asset>.pyin a shell loop works fine (each invocation is a freshbruinprocess with its own scheduler, each of which happily runs its one asset and exits).Hypothesis
The debug log at
scheduler.go:722confirms a one-shot completion receiver: the log linereceived task result: Xappears exactly once per run, regardless of how many tasks actually finish. Task 1's completion is received and acted on (triggering the next dispatch); task 2's completion is silently dropped.This is consistent with:
<-chrather than a loopselectstatement missing a re-armreturns after the first iterationThe
--workers 1case rules out any worker-pool race, concurrency contention, or channel-capacity issue. This is a deterministic state-machine/control-flow bug in the scheduler's result-handling code path.Happy to provide a minimal repro, additional debug output, goroutine dumps (
SIGQUIT), or pprof profiles on request.