fix(#661): v1.7.23 — size-aware hook rebind guard stops conductor history loss by asheshgoplani · Pull Request #664 · asheshgoplani/agent-deck

asheshgoplani · 2026-04-18T15:34:34Z

Summary

Closes #661. Adds a size-aware rebind guard to Instance.UpdateHookStatus's Claude path so a fresh 1-record jsonl cannot overwrite a rich historic jsonl on the UserPromptSubmit flap. This is the missing counterpart to v1.7.7, which fixed the tmux_env rebind path but left the hook_payload path with a weaker binary has-any-data check. Filed #662 for the structural follow-up in buildClaudeResumeCommand (v1.7.24 scope).

Root cause (manifestation 3)

Travel conductor repro (3× in 7 hours):

bind    hook_payload  new=dd17cb25   SessionStart           (original, 974KB jsonl)
rebind  hook_payload  50fe72cc→dd17cb25  SessionEnd
reject  hook_payload  dd17cb25  reason=candidate_has_no_conversation_data
rebind  hook_payload  dd17cb25→50fe72cc  UserPromptSubmit   ← THE BUG

By the time UserPromptSubmit arrives, the fresh jsonl holds exactly 1 record and passes sessionHasConversationData. v1.7.7's binary check let the rebind through, silently replacing the rich 974KB jsonl with the fresh 1-record one. On the next restart, buildClaudeResumeCommand resumed the fresh ID and history evaporated.

Fix

New helper sessionConversationByteSize(inst, sessionID) int64 returns jsonl file size in bytes. UpdateHookStatus now decides:

Cold start (i.ClaudeSessionID == "") → accept (first bind).
Candidate has zero sessionId records → reject candidate_has_no_conversation_data (v1.7.7, preserved).
Both sides have data → candidate must be strictly larger in bytes. Otherwise reject candidate_has_less_conversation_data.

Byte size is a robust "how much history this session holds" proxy — immune to record-count ties and faster than rescanning.

The tmux_env rebind path (UpdateClaudeSession, line 2750-2781) already had a stronger guard; this brings the hook_payload path to parity. The successful-bind bookkeeping is extracted into bindClaudeSessionFromHook so cold-start and rebind paths share one implementation.

Tests

Six new tests in internal/session/instance_test.go:

TestInstance_UpdateHookStatus_RejectsRebindWhenCurrentHasMoreData — the Conductor history loss — UserPromptSubmit rebind path lacks 'current_has_more_data' guard (manifestation 3) #661 pin.
TestInstance_UpdateHookStatus_RejectsRebindWhenCurrentHasDataCandidateEmpty — v1.7.7 path preserved.
TestInstance_UpdateHookStatus_AllowsRebindWhenCurrentIsEmpty — healthy first-prompt rebind.
TestInstance_UpdateHookStatus_AllowsRebindWhenCurrentIsUnset — cold start.
TestInstance_UpdateHookStatus_RejectsBidirectionalFlap — 3-cycle travel-conductor replay.
TestInstance_BuildClaudeResumeCommand_AfterFlap_ResumesRichID — live-runtime boundary: the emitted restart command must contain --resume <rich-uuid>.

Six matching regression entries appended to .claude/release-tests.yaml.

Telegram-direction non-event (three reverted commits in history)

Three commits (43953e9, b1b53f5, 24bb286) reversed the v1.7.22 telegram topology validator based on a wrong interim reading. Codex-reviewed and empirically verified in the same window: v1.7.22's original direction (enabledPlugins.telegram@...=false globally + per-conductor --channels) is correct. 53526bf reverts those three; the v1.7.22 validator, tests, and docs are unchanged from main. The revert is preserved in history as an honest record.

Test plan

go test ./... -race -count=1 -timeout 900s — green across 30+ packages
5× consecutive TestInstance_UpdateHookStatus_RejectsBidirectionalFlap + TestInstance_BuildClaudeResumeCommand_AfterFlap_ResumesRichID with -race — all green
TestPersistence_* (mandatory session-lifecycle gate) — green
Telegram validator suite — green (v1.7.22 state preserved)
GH Actions on this PR — pending

Committed by Ashesh Goplani

…history loss Adds 5 new UpdateHookStatus tests pinning the manifestation-3 flap: UserPromptSubmit arrives with a fresh 1-record jsonl ID and silently overwrites the rich hundreds-of-KB historic jsonl ID on every restart. Tests: - RejectsRebindWhenCurrentHasMoreData — pins size-based rejection - RejectsRebindWhenCurrentHasDataCandidateEmpty — preserves v1.7.7 guard - AllowsRebindWhenCurrentIsEmpty — healthy first-prompt rebind - AllowsRebindWhenCurrentIsUnset — cold start - RejectsBidirectionalFlap — replays 3× travel conductor flap from lifecycle log All 5 are RED against v1.7.22. GREEN arrives in the follow-up impl commit. Refs #661 Committed by Ashesh Goplani

…ry loss Adds a third decision branch to Instance.UpdateHookStatus's claude path: when BOTH current and candidate session IDs have conversation data, the candidate must have STRICTLY MORE bytes on disk to win the rebind. This plugs manifestation 3 of the conductor history-loss flap where: 1. Fresh claude boots → SessionStart with empty jsonl → v1.7.7 rejects ✓ 2. User types first prompt → UserPromptSubmit arrives with the candidate jsonl now holding exactly 1 record 3. v1.7.7's binary has-any-data check passes → rebind executes and silently overwrites the rich 974KB historic jsonl Byte size is the robust "how much history this session holds" proxy — immune to record-count ties and faster than rescanning the file. The tmux_env rebind path (UpdateClaudeSession, line 2750-2781) already had a stronger guard; this brings the hook_payload path to parity. New helper: sessionConversationByteSize(inst, sessionID) int64 — reuses the per-instance config-dir lookup + findSessionFileInAllProjects fallback used by sessionHasConversationData. Refactor: the successful-bind bookkeeping (lifecycle event, field assignments, tmux env propagation) extracted into bindClaudeSessionFromHook so cold-start and rebind paths share one implementation instead of copy-pasting. Tests: all 5 new TestInstance_UpdateHookStatus_* tests pass with -race, plus TestInstance_BuildClaudeResumeCommand_AfterFlap_ResumesRichID pins the live boundary (the next restart's --resume argument must be the rich ID, not the overwritten fresh one). Refs #661 Committed by Ashesh Goplani

…erted Production verification showed the v1.7.22 model was wrong. --channels only SUBSCRIBES a session to a channel; it does not force-load the plugin. So: - enabledPlugins.telegram@...=true is REQUIRED, not an anti-pattern — it's how the plugin actually loads into the claude process. - Global enablement + --channels does NOT double-load; the plugin loads once, and --channels wires the subscription for whichever session owns the channel. Other sessions simply never subscribe. - The real misconfiguration is the opposite: global DISABLED + --channels on a conductor. The plugin never loads, --channels is a no-op, and the conductor has no telegram bridge at all (silent outage). New test suite expresses the corrected semantics: - GlobalEnabledWithChannels_NoWarning (canonical topology) - GlobalEnabled_NoChannels_NoWarning (non-subscribing sessions fine) - GlobalDisabled_NoChannels_NoWarning (nothing to warn about) - GlobalDisabled_WithChannels_WarnChannelsWithoutPlugin (THE new warning) - GlobalDisabled_WithNonTelegramChannels_NoWarning (prefix-match guard) - WrapperStateDir_AntiPattern (WRAPPER_DEPRECATED unchanged) - WrapperStateDir_NoTelegramChannel_NoWarning (unchanged) - v1_7_22_CodesRemoved (permanence guard for GLOBAL_ANTIPATTERN/DOUBLE_LOAD) CLI tests flipped to match: TestConductorSetup_WarnsOnChannelsWithoutGlobalPlugin replaces TestConductorSetup_WarnsOnGlobalTelegramEnabled, and the "clean/silent" case is now the canonical topology (global=true + --channels) rather than its opposite. All 5 new assertions are RED against the v1.7.22 validator. GREEN arrives in the follow-up impl commit that rewrites the validator logic. Refs #661 #658 Committed by Ashesh Goplani

…logy Removes GLOBAL_ANTIPATTERN and DOUBLE_LOAD codes. Adds CHANNELS_WITHOUT_GLOBAL_PLUGIN for the real misconfiguration: conductor sessions that subscribe via --channels while the plugin is globally disabled, so the plugin never loads and the subscription silently no-ops. WRAPPER_DEPRECATED is unchanged — env_file is still the canonical injection mechanism. Validator surface: Canonical topology (global=true + --channels) → 0 warnings Global-only, non-subscribing session → 0 warnings Channels without global plugin (silent bridge outage) → CHANNELS_WITHOUT_GLOBAL_PLUGIN Wrapper-based TELEGRAM_STATE_DIR with telegram channel → WRAPPER_DEPRECATED Child-env-leak (#658) remains a real problem and its fix is unchanged — this commit only corrects the user-facing warning signal that was telling people to misconfigure themselves out of having a bridge. Refs #661 #658 Committed by Ashesh Goplani

…t entries Three coordinated doc surfaces updated to match the verified topology: - skills/agent-deck/SKILL.md: the "Telegram conductor topology" section now says enable the plugin globally + subscribe per-conductor via --channels + inject TELEGRAM_STATE_DIR via env_file. The anti-patterns table replaces GLOBAL_ANTIPATTERN / DOUBLE_LOAD with the single new code CHANNELS_WITHOUT_GLOBAL_PLUGIN that warns on the real misconfiguration. Adds a "Topology note" explaining why v1.7.22 was reversed so older readers who saw the old guidance don't re-regress. Removes the stale "Many competing telegram pollers" section whose advice to disable globally was the exact cause of the travel-conductor silent-bridge outage. - conductor/conductor-claude.md: runbook preflight + debug checklist now say enable the plugin globally, look for CHANNELS_WITHOUT_GLOBAL_PLUGIN or WRAPPER_DEPRECATED warnings (not the removed v1.7.22 codes), and calls out the topology-note-why-reversed explicitly so any conductor still reading v1.7.22 runbooks is redirected. - .claude/release-tests.yaml: drops the six v1722-telegram-* manifest entries (they pinned the wrong warning direction) and appends 12 new v1723-* entries — 6 for issue #661 history-loss guard (RejectsRebindWhenCurrentHasMoreData, RejectsRebindWhenCurrentHasDataCandidateEmpty, AllowsRebindWhenCurrentIsEmpty, AllowsRebindWhenCurrentIsUnset, RejectsBidirectionalFlap, BuildClaudeResumeCommand_AfterFlap_ResumesRichID) and 6 for the validator reversal (GlobalEnabledWithChannels_NoWarning, GlobalDisabled_WithChannels_WarnChannelsWithoutPlugin, v1_7_22_CodesRemoved, WrapperStateDir_AntiPattern, ReadTelegramGloballyEnabled_, WarnsOnChannelsWithoutGlobalPlugin). Refs #661 #658 Committed by Ashesh Goplani

… manifest entries" Revert "fix(#661): GREEN — telegram validator reversed to match verified topology" Revert "test(#661): RED — v1.7.22 telegram topology warning direction was inverted" The three preceding commits reversed the v1.7.22 telegram topology validator on a reading of in-conductor reports that turned out to be wrong. Codex-reviewed and empirically verified PM 2026-04-18: - enabledPlugins."telegram@claude-plugins-official" = FALSE globally is the correct topology. With global=true, every Task-subagent and agent-deck child auto-loads the plugin and starts its own bun telegram poller → duplicate pollers race on the same bot token → Telegram rejects one with 409 Conflict → "bot goes silent after a while" symptom. - --channels plugin:telegram@... is what ACTIVATES the plugin for a given session. The plugin must be INSTALLED in the profile (claude plugin list shows it), but must NOT be in enabledPlugins globally. - Verification state at revert time: 3 conductor pollers (agent-deck, innotrade, travel) all stable with global=false, each with a distinct TELEGRAM_STATE_DIR, no 409s. This revert restores the v1.7.22 validator (GLOBAL_ANTIPATTERN, DOUBLE_LOAD, WRAPPER_DEPRECATED codes), its tests, the doc sections in skills/agent-deck/SKILL.md and conductor/conductor-claude.md, and the six v1722-telegram-* manifest entries in .claude/release-tests.yaml. Fix (a) for #661 (size-aware hook rebind guard in UpdateHookStatus) is untouched — it lives in b2104ef and d1ee43d, which are orthogonal to the telegram direction question. The six v1723-history-loss-* manifest entries dropped by this revert will be re-added in a follow-up commit. Refs #661 #658 Committed by Ashesh Goplani

Six new entries pinning the UpdateHookStatus size-aware rebind guard added in b2104ef. These tests are permanent — any future change to Instance.UpdateHookStatus or sessionConversationByteSize must run them. Entries: - v1723-history-loss-rebind-rejects-less-data (the main #661 guard) - v1723-history-loss-rebind-empty-candidate (v1.7.7 fallback preserved) - v1723-history-loss-rebind-allows-current-empty (healthy bring-up) - v1723-history-loss-rebind-allows-cold-start (first bind accepted) - v1723-history-loss-flap-replay (3-cycle travel-conductor replay) - v1723-history-loss-boundary-resume (buildClaudeResumeCommand pin) Separate from the reverted telegram-direction commits so the manifest stays aligned with the code shipped. Refs #661 Committed by Ashesh Goplani

Ships fix for #661 (conductor history-loss guard on the UserPromptSubmit hook rebind path). Size-aware rejection prevents a fresh 1-record jsonl from overwriting a rich historic jsonl when both pass the v1.7.7 has-any-data check. Structural follow-up tracked as #662 (sessionHasConversationData false-negative in buildClaudeResumeCommand) — v1.7.24 scope, not shipped here. Refs #661 Committed by Ashesh Goplani

…856) (#883) * docs: add explainer images for each user-facing concept Five PNG diagrams in documentation/assets/ embedded into the matching guides. Generated via codex CLI's built-in image_gen tool (gpt-image-2), Tokyo Night palette, technical-architecture style. - conductor-overview.png → CONDUCTOR.md hero - channels-topology.png → CONDUCTOR.md (one-bot-per-conductor) - watcher-doorbell.png → WATCHERS.md hero - skills-tiers.png → SKILLS.md hero - watchdog-restart.png → WATCHDOG.md hero Total ~5.6MB; each image roughly 1MB. PNG-only (gpt-image-2 doesn't emit SVG). Recipe to regenerate is in ~/.agent-deck/conductor/CLAUDE.md in case any concept evolves and the diagrams need refreshing. No code changes. * fix(session): rebind on /clear-created new jsonl regardless of size (#856) The v1.7.23 size-guard added in PR #664 (issue #661) used a strict `<=` byte comparison: candidate must have strictly more bytes than current to win a hook rebind. This stops the UserPromptSubmit flap that lets a fresh 1-record jsonl overwrite a 974KB historic jsonl, but also rejects every user-initiated new session — `/clear`, plain `claude`, picker `--resume` — since fresh sessions are smaller by definition. The tile stays bound to the old uuid until the new jsonl outgrows it in bytes (~209KB in the real-install evidence). Fix: add an mtime-gap escape hatch. In a #661 flap the user keeps typing into the rich session, so its mtime stays fresh; in /clear the user abandons the old session, so its mtime stales while the new jsonl's mtime advances. When candidate.mtime - current.mtime >= 5s (clearRebindMtimeGrace), treat the candidate as a user-initiated new session and rebind regardless of size. 5s is well above the ~2s hook poll cadence (flaps touch both files inside that window) but well below the time a /clear + follow-up prompt takes. The new helper sessionConversationMtime mirrors sessionConversationByteSize exactly — same per-instance config-dir lookup, same all-projects fallback, same graceful degradation to zero on missing file. Coverage: - TestInstance_UpdateHookStatus_ClearCreatesNewSession_RebindsRegardlessOfSize reproduces the issue: 200KB old jsonl with 2-min-old mtime, 1-record new jsonl with current mtime → must rebind to new id. - All existing #661 protection tests still pass: flap test seeds both files within microseconds, so mtime gap is well below 5s and the size guard remains the deciding factor. Committed by Ashesh Goplani

asheshgoplani added 8 commits April 18, 2026 15:06

asheshgoplani merged commit ab3dce4 into main Apr 18, 2026
2 of 4 checks passed

asheshgoplani deleted the fix/v1723-history-loss-guard branch April 23, 2026 06:52

ZDreamer2 mentioned this pull request May 4, 2026

Size-guard rejects new session after /clear until it outgrows the old one #856

Closed

asheshgoplani mentioned this pull request May 6, 2026

fix(session): rebind on /clear-created new jsonl regardless of size (#856) #883

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#661): v1.7.23 — size-aware hook rebind guard stops conductor history loss#664

fix(#661): v1.7.23 — size-aware hook rebind guard stops conductor history loss#664
asheshgoplani merged 8 commits intomainfrom
fix/v1723-history-loss-guard

asheshgoplani commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant