Scheduler completion receiver fires only once (scheduler.go:722) — pipeline hangs after 1 follow-up dispatch (macOS arm64)

# Scheduler deadlocks after first batch of parallel Python assets — no error, no further scheduling

## Summary

`bruin run` reliably hangs after dispatching exactly **one** asset beyond the initial worker-pool fill, regardless of `--workers` value. The scheduler logs `started the scheduler loop` once at startup, fills the initial worker slots, dispatches exactly one follow-up asset as slots free, then emits no further output forever.

The pattern is independent of concurrency level:

| `--workers` | Assets completed before hang |
|---|---|
| 16 (default) | 17 (first 16 + 1 follow-up) |
| 1 | 2 (first 1 + 1 follow-up) |

This is **not** a parallelism race — a single-worker run hangs after 2 assets. It looks like the completion handler processes exactly one completion event per scheduler lifetime and then stops listening.

All spawned Python subprocesses finish cleanly and exit; the parent `bruin` process stays alive at 0% CPU with no open network sockets and no child processes, but never schedules additional work. No error, exception, panic, or "task failed" event is logged. **`Ctrl+C` does not interrupt it — the user has to `kill <pid>` externally.**

## Environment

- **Bruin version**: `v0.11.544` (commit `51264459f35c3abbd3af804ef674371c60a5d349`) — current latest as of this report
- **OS**: macOS on Apple Silicon (Darwin 25.4.0, `arm64` / `aarch64-apple-darwin`)
- **Longevity**: also reproduced on the prior installed `bruin` version; upgrading to `v0.11.544` did **not** fix it. Not a fresh regression — this has been present for at least the last few versions the user has installed.
- **Pipeline size**: 449 assets (69 Python bronze + ~20 other bronze + ~300 SQL silver/gold/staging)
- **Command**: `bruin --debug run --verbose -x load -x expensive .`
- **Destinations**: BigQuery (SQL), GCS (parquet uploads from Python assets)
- **Python asset workload**: `pandas.read_sql` from SQL Server via pymssql → `save_to_parquet` to GCS → `create_external_table` in BigQuery

## Reproduction

1. Pipeline with 449 assets, of which ~90 are Python type and the rest SQL views. Python assets call `pd.read_sql(...)`, upload parquet to GCS, and create BigQuery external tables. Each Python bronze asset is independent (no `depends:` between them; SQL silver/gold assets depend on them).
2. Run: `bruin --debug run --verbose -x load -x expensive . 2>&1 | tee /tmp/bruin-debug.log`
3. Observe: exactly **17** Python assets run (16 concurrent workers + 1 follow-up as a slot frees). All 17 emit `Finished: ...` log lines.
4. **After the last `Finished:` event, the log goes silent.** The scheduler does not spawn any new Python assets, nor does it begin any of the eligible SQL view assets.
5. Repeat with `bruin --debug run --verbose --workers 1 -t bronze -x expensive .` — same hang, after **2** completed assets.
6. Process state after the hang (confirmed via `ps`, `lsof`, `pgrep`):
   - `STAT`: `S+` (sleeping in foreground)
   - `%CPU`: 0.0
   - Children: **none**
   - Open network sockets: **none** (only internal pipes)
7. The hang is deterministic across runs.

## Debug log: the smoking gun

With `--workers 1 --debug`, the debug log shows **exactly one** `scheduler.go:722 received task result: X` event per scheduler lifetime:

```
20:46:36  DEBUG  scheduler/scheduler.go:715  started the scheduler loop
20:46:36  DEBUG  python/operator.go:97  running Python asset bronze.colleague.acad_program_reqmts ...
[20:47:03] Finished: bronze.colleague.acad_program_reqmts (26.114s)
20:47:03  DEBUG  scheduler/scheduler.go:722  received task result: bronze.colleague.acad_program_reqmts   ← fires
20:47:03  DEBUG  python/operator.go:97  running Python asset bronze.colleague.acad_program_reqmts_ls ...
[20:47:08] Finished: bronze.colleague.acad_program_reqmts_ls (5.262s)
<<< NO "received task result" for acad_program_reqmts_ls >>>
<<< scheduler silent for the rest of the run >>>
```

Task 1 completes → scheduler receives the result → dispatches task 2.
Task 2 completes → **scheduler never receives the result** → nothing further is dispatched.

The user-facing `Finished:` line appears (emitted by the operator itself when the subprocess exits), so we know the Python worker exited normally. But [scheduler.go:722](https://github.com/bruin-data/bruin/blob/main/pkg/scheduler/scheduler.go#L722) only fires once per scheduler lifetime — suggesting the result-channel is drained once and then not re-read, or the receiver goroutine exits after one iteration.

With `--workers 16`, the same pattern holds: the initial 16 workers spawn simultaneously, one task's result is processed (triggering a 17th dispatch), and then subsequent completions are silently dropped.

## Expected behavior

After a Python asset finishes, the scheduler should:
1. Mark the task complete in the task graph
2. Re-evaluate the DAG for newly-eligible assets
3. Dispatch eligible work to freed worker slots

## Observed behavior

Worker slots free up (confirmed — 17 Python processes exit cleanly with complete GCS parquet files and BigQuery external tables) but the scheduler never dispatches further work. Process stays alive at 0% CPU indefinitely.

## Workarounds tried

- **Default `--workers 16`**: hangs after 17 completions.
- **`--workers 1`**: hangs after 2 completions.
- **`bruin clean`** (wipes uv caches and logs) + **`rm -rf __pycache__`**: no effect — same hang, same debug signature.
- **Bypassing the scheduler entirely**: running `bruin run <single-asset>.py` in a shell loop works fine (each invocation is a fresh `bruin` process with its own scheduler, each of which happily runs its one asset and exits).

## Hypothesis

The debug log at `scheduler.go:722` confirms a **one-shot completion receiver**: the log line `received task result: X` appears exactly once per run, regardless of how many tasks actually finish. Task 1's completion is received and acted on (triggering the next dispatch); task 2's completion is silently dropped.

This is consistent with:
- A result channel being drained via a single `<-ch` rather than a loop
- A `select` statement missing a re-arm
- A receiver goroutine that `return`s after the first iteration

The `--workers 1` case rules out any worker-pool race, concurrency contention, or channel-capacity issue. This is a deterministic state-machine/control-flow bug in the scheduler's result-handling code path.

Happy to provide a minimal repro, additional debug output, goroutine dumps (`SIGQUIT`), or pprof profiles on request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler completion receiver fires only once (scheduler.go:722) — pipeline hangs after 1 follow-up dispatch (macOS arm64) #1952

Scheduler deadlocks after first batch of parallel Python assets — no error, no further scheduling

Summary

Environment

Reproduction

Debug log: the smoking gun

Expected behavior

Observed behavior

Workarounds tried

Hypothesis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

`--workers`	Assets completed before hang
16 (default)	17 (first 16 + 1 follow-up)
1	2 (first 1 + 1 follow-up)

Scheduler completion receiver fires only once (scheduler.go:722) — pipeline hangs after 1 follow-up dispatch (macOS arm64) #1952

Description

Scheduler deadlocks after first batch of parallel Python assets — no error, no further scheduling

Summary

Environment

Reproduction

Debug log: the smoking gun

Expected behavior

Observed behavior

Workarounds tried

Hypothesis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions