Skip to content

fix(controlpipe): EOF-clean shutdown to eliminate tmux SIGSEGV (#816, thanks @tarekrached)#882

Merged
asheshgoplani merged 1 commit intomainfrom
fix/816-controlpipe-eof-clean-shutdown
May 7, 2026
Merged

fix(controlpipe): EOF-clean shutdown to eliminate tmux SIGSEGV (#816, thanks @tarekrached)#882
asheshgoplani merged 1 commit intomainfrom
fix/816-controlpipe-eof-clean-shutdown

Conversation

@asheshgoplani
Copy link
Copy Markdown
Owner

Summary

Cherry-picks @tarekrached's fix from tarek/controlpipe-eof-clean-shutdown for issue #816 (tmux SIGSEGV on shutdown, tmux/tmux#4980).

PR #778 softened ControlPipe.Close() from SIGKILL to SIGTERM+grace+SIGKILL, but tmux 3.6a continues to crash with the same control_notify_client_detached → NULL deref signature. The bug is server-side: any signal-driven detach can race the server's notify walk; client-side grace narrows the race but cannot close it.

This PR avoids the race entirely by leveraging tmux -C's orderly-detach %exit codepath, which empirically does not trigger the crash:

  1. cp.stdin.Close() → EOF triggers orderly %exit detach
  2. wait up to 200ms — fast path: child self-exits in ~1-4ms
  3. softKillProcessGroup — fallback for stuck/wedged children only

The signal-driven path is preserved verbatim for the rare fallback case — no safety regression vs. PR #778.

Empirical validation (from @tarekrached)

Stress test: 50 concurrent control clients, simultaneous stdin EOF, server-alive verified after 36/36 trials, zero non-zero exits.

Tests

Three new tests in internal/tmux/softkill_test.go:

  • TestReapWithEOFGrace_FastPathOnEOF — antimarker absence proves no SIGTERM was delivered on the happy path
  • TestReapWithEOFGrace_FallbackOnHungChild — SIGTERM→SIGKILL safety still triggers when EOF is ignored
  • TestReapWithEOFGrace_AlreadyDeadIsNoop — already-exited child doesn't panic

All pass under -race. Full internal/tmux package green (17.4s under -race). Wider test suite (cmd/agent-deck + all internal/*) all green.

Test plan

  • go build ./... clean
  • go test -race -count=1 ./internal/tmux/ — 5/5 relevant tests pass, full package green
  • Tarek's stress trials (50-way concurrent close × 36 runs) — 36/36 clean
  • Manual smoke: kill agent-deck while sessions are active, verify tmux server survives

Note on push

Pushed with --no-verify because a pre-existing staticcheck SA9003: empty branch warning at cmd/agent-deck/session_cmd.go:2074 (introduced on main by #879) trips the pre-push hook. Unrelated to this fix; tracked separately.

Co-authored with @tarekrached — original commit, branch, tests, and stress validation are entirely his work. Reattribution preserved via git author + Co-Authored-By trailer.

Closes #816

PR #778 (f46ceb1) softened ControlPipe.Close() from SIGKILL to
SIGTERM+grace+SIGKILL, but tmux 3.6a continues to crash with the same
control_notify_client_detached → NULL deref signature (tmux/tmux#4980).
The bug is server-side: any signal-driven detach can race the server's
notify walk; client-side grace narrows the race but cannot close it.

Empirically (see PLAN.md "Empirical validation", 2026-04-29):
`tmux -C` exits cleanly on stdin EOF in 1-4ms via the protocol's
%exit orderly-detach codepath, which is a different server path than
signal-driven detach and does not trigger the crash. Stress test:
50 concurrent control clients, simultaneous stdin EOF, server-alive
verified after 36/36 trials, zero non-zero exits.

Close() is now staged:
1. cp.stdin.Close()      — EOF triggers orderly %exit detach
2. wait up to 200ms      — fast path: child self-exits in ~1-4ms
3. softKillProcessGroup  — fallback for stuck/wedged children only

The new helper reapWithEOFGrace composes a once-guarded reap with a
timeout select; on timeout it escalates via the existing soft-kill
helpers. The signal-driven path is preserved verbatim for the rare
fallback case — no safety regression vs. PR #778.

Tests: TestReapWithEOFGrace_FastPathOnEOF asserts no signal is sent
when the child exits on EOF (antimarker absence is conclusive: the
helper would write antimarker if its SIGTERM handler fired).
TestReapWithEOFGrace_FallbackOnHungChild asserts the existing
SIGTERM→SIGKILL safety still triggers when EOF is ignored.
TestReapWithEOFGrace_AlreadyDeadIsNoop covers the already-exited race.

Committed by: Ashesh Goplani
Co-Authored-By: tarekrached <tarek.rached@equilibriumenergy.com>
@asheshgoplani asheshgoplani merged commit f166fcb into main May 7, 2026
3 checks passed
asheshgoplani added a commit that referenced this pull request May 7, 2026
Three real-bug fixes:
- #856 size-guard rebinds on /clear (PR #883)
- #816 tmux SIGSEGV — adopted @tarekrached's clean-shutdown fix (PR #882)
- #881 profile divergence — unified TUI/web profile resolution (PR #884)

Committed by Ashesh Goplani
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tmux SIGSEGV (#4980) still occurs on v1.7.72 — soft-kill in ControlPipe.Close() can't close the server-side race

2 participants