Skip to content

Scheduler completion receiver fires only once (scheduler.go:722) — pipeline hangs after 1 follow-up dispatch (macOS arm64) #1952

@agcrum

Description

@agcrum

Scheduler deadlocks after first batch of parallel Python assets — no error, no further scheduling

Summary

bruin run reliably hangs after dispatching exactly one asset beyond the initial worker-pool fill, regardless of --workers value. The scheduler logs started the scheduler loop once at startup, fills the initial worker slots, dispatches exactly one follow-up asset as slots free, then emits no further output forever.

The pattern is independent of concurrency level:

--workers Assets completed before hang
16 (default) 17 (first 16 + 1 follow-up)
1 2 (first 1 + 1 follow-up)

This is not a parallelism race — a single-worker run hangs after 2 assets. It looks like the completion handler processes exactly one completion event per scheduler lifetime and then stops listening.

All spawned Python subprocesses finish cleanly and exit; the parent bruin process stays alive at 0% CPU with no open network sockets and no child processes, but never schedules additional work. No error, exception, panic, or "task failed" event is logged. Ctrl+C does not interrupt it — the user has to kill <pid> externally.

Environment

  • Bruin version: v0.11.544 (commit 51264459f35c3abbd3af804ef674371c60a5d349) — current latest as of this report
  • OS: macOS on Apple Silicon (Darwin 25.4.0, arm64 / aarch64-apple-darwin)
  • Longevity: also reproduced on the prior installed bruin version; upgrading to v0.11.544 did not fix it. Not a fresh regression — this has been present for at least the last few versions the user has installed.
  • Pipeline size: 449 assets (69 Python bronze + ~20 other bronze + ~300 SQL silver/gold/staging)
  • Command: bruin --debug run --verbose -x load -x expensive .
  • Destinations: BigQuery (SQL), GCS (parquet uploads from Python assets)
  • Python asset workload: pandas.read_sql from SQL Server via pymssql → save_to_parquet to GCS → create_external_table in BigQuery

Reproduction

  1. Pipeline with 449 assets, of which ~90 are Python type and the rest SQL views. Python assets call pd.read_sql(...), upload parquet to GCS, and create BigQuery external tables. Each Python bronze asset is independent (no depends: between them; SQL silver/gold assets depend on them).
  2. Run: bruin --debug run --verbose -x load -x expensive . 2>&1 | tee /tmp/bruin-debug.log
  3. Observe: exactly 17 Python assets run (16 concurrent workers + 1 follow-up as a slot frees). All 17 emit Finished: ... log lines.
  4. After the last Finished: event, the log goes silent. The scheduler does not spawn any new Python assets, nor does it begin any of the eligible SQL view assets.
  5. Repeat with bruin --debug run --verbose --workers 1 -t bronze -x expensive . — same hang, after 2 completed assets.
  6. Process state after the hang (confirmed via ps, lsof, pgrep):
    • STAT: S+ (sleeping in foreground)
    • %CPU: 0.0
    • Children: none
    • Open network sockets: none (only internal pipes)
  7. The hang is deterministic across runs.

Debug log: the smoking gun

With --workers 1 --debug, the debug log shows exactly one scheduler.go:722 received task result: X event per scheduler lifetime:

20:46:36  DEBUG  scheduler/scheduler.go:715  started the scheduler loop
20:46:36  DEBUG  python/operator.go:97  running Python asset bronze.colleague.acad_program_reqmts ...
[20:47:03] Finished: bronze.colleague.acad_program_reqmts (26.114s)
20:47:03  DEBUG  scheduler/scheduler.go:722  received task result: bronze.colleague.acad_program_reqmts   ← fires
20:47:03  DEBUG  python/operator.go:97  running Python asset bronze.colleague.acad_program_reqmts_ls ...
[20:47:08] Finished: bronze.colleague.acad_program_reqmts_ls (5.262s)
<<< NO "received task result" for acad_program_reqmts_ls >>>
<<< scheduler silent for the rest of the run >>>

Task 1 completes → scheduler receives the result → dispatches task 2.
Task 2 completes → scheduler never receives the result → nothing further is dispatched.

The user-facing Finished: line appears (emitted by the operator itself when the subprocess exits), so we know the Python worker exited normally. But scheduler.go:722 only fires once per scheduler lifetime — suggesting the result-channel is drained once and then not re-read, or the receiver goroutine exits after one iteration.

With --workers 16, the same pattern holds: the initial 16 workers spawn simultaneously, one task's result is processed (triggering a 17th dispatch), and then subsequent completions are silently dropped.

Expected behavior

After a Python asset finishes, the scheduler should:

  1. Mark the task complete in the task graph
  2. Re-evaluate the DAG for newly-eligible assets
  3. Dispatch eligible work to freed worker slots

Observed behavior

Worker slots free up (confirmed — 17 Python processes exit cleanly with complete GCS parquet files and BigQuery external tables) but the scheduler never dispatches further work. Process stays alive at 0% CPU indefinitely.

Workarounds tried

  • Default --workers 16: hangs after 17 completions.
  • --workers 1: hangs after 2 completions.
  • bruin clean (wipes uv caches and logs) + rm -rf __pycache__: no effect — same hang, same debug signature.
  • Bypassing the scheduler entirely: running bruin run <single-asset>.py in a shell loop works fine (each invocation is a fresh bruin process with its own scheduler, each of which happily runs its one asset and exits).

Hypothesis

The debug log at scheduler.go:722 confirms a one-shot completion receiver: the log line received task result: X appears exactly once per run, regardless of how many tasks actually finish. Task 1's completion is received and acted on (triggering the next dispatch); task 2's completion is silently dropped.

This is consistent with:

  • A result channel being drained via a single <-ch rather than a loop
  • A select statement missing a re-arm
  • A receiver goroutine that returns after the first iteration

The --workers 1 case rules out any worker-pool race, concurrency contention, or channel-capacity issue. This is a deterministic state-machine/control-flow bug in the scheduler's result-handling code path.

Happy to provide a minimal repro, additional debug output, goroutine dumps (SIGQUIT), or pprof profiles on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions