fix(#661): v1.7.23 — size-aware hook rebind guard stops conductor history loss#664
Merged
asheshgoplani merged 8 commits intomainfrom Apr 18, 2026
Merged
fix(#661): v1.7.23 — size-aware hook rebind guard stops conductor history loss#664asheshgoplani merged 8 commits intomainfrom
asheshgoplani merged 8 commits intomainfrom
Conversation
…history loss Adds 5 new UpdateHookStatus tests pinning the manifestation-3 flap: UserPromptSubmit arrives with a fresh 1-record jsonl ID and silently overwrites the rich hundreds-of-KB historic jsonl ID on every restart. Tests: - RejectsRebindWhenCurrentHasMoreData — pins size-based rejection - RejectsRebindWhenCurrentHasDataCandidateEmpty — preserves v1.7.7 guard - AllowsRebindWhenCurrentIsEmpty — healthy first-prompt rebind - AllowsRebindWhenCurrentIsUnset — cold start - RejectsBidirectionalFlap — replays 3× travel conductor flap from lifecycle log All 5 are RED against v1.7.22. GREEN arrives in the follow-up impl commit. Refs #661 Committed by Ashesh Goplani
…ry loss
Adds a third decision branch to Instance.UpdateHookStatus's claude path:
when BOTH current and candidate session IDs have conversation data, the
candidate must have STRICTLY MORE bytes on disk to win the rebind. This
plugs manifestation 3 of the conductor history-loss flap where:
1. Fresh claude boots → SessionStart with empty jsonl → v1.7.7 rejects ✓
2. User types first prompt → UserPromptSubmit arrives with the
candidate jsonl now holding exactly 1 record
3. v1.7.7's binary has-any-data check passes → rebind executes and
silently overwrites the rich 974KB historic jsonl
Byte size is the robust "how much history this session holds" proxy —
immune to record-count ties and faster than rescanning the file. The
tmux_env rebind path (UpdateClaudeSession, line 2750-2781) already had a
stronger guard; this brings the hook_payload path to parity.
New helper: sessionConversationByteSize(inst, sessionID) int64 — reuses
the per-instance config-dir lookup + findSessionFileInAllProjects
fallback used by sessionHasConversationData.
Refactor: the successful-bind bookkeeping (lifecycle event, field
assignments, tmux env propagation) extracted into
bindClaudeSessionFromHook so cold-start and rebind paths share one
implementation instead of copy-pasting.
Tests: all 5 new TestInstance_UpdateHookStatus_* tests pass with -race,
plus TestInstance_BuildClaudeResumeCommand_AfterFlap_ResumesRichID pins
the live boundary (the next restart's --resume argument must be the rich
ID, not the overwritten fresh one).
Refs #661
Committed by Ashesh Goplani
…erted
Production verification showed the v1.7.22 model was wrong. --channels
only SUBSCRIBES a session to a channel; it does not force-load the
plugin. So:
- enabledPlugins.telegram@...=true is REQUIRED, not an anti-pattern —
it's how the plugin actually loads into the claude process.
- Global enablement + --channels does NOT double-load; the plugin
loads once, and --channels wires the subscription for whichever
session owns the channel. Other sessions simply never subscribe.
- The real misconfiguration is the opposite: global DISABLED +
--channels on a conductor. The plugin never loads, --channels is a
no-op, and the conductor has no telegram bridge at all (silent
outage).
New test suite expresses the corrected semantics:
- GlobalEnabledWithChannels_NoWarning (canonical topology)
- GlobalEnabled_NoChannels_NoWarning (non-subscribing sessions fine)
- GlobalDisabled_NoChannels_NoWarning (nothing to warn about)
- GlobalDisabled_WithChannels_WarnChannelsWithoutPlugin (THE new warning)
- GlobalDisabled_WithNonTelegramChannels_NoWarning (prefix-match guard)
- WrapperStateDir_AntiPattern (WRAPPER_DEPRECATED unchanged)
- WrapperStateDir_NoTelegramChannel_NoWarning (unchanged)
- v1_7_22_CodesRemoved (permanence guard for GLOBAL_ANTIPATTERN/DOUBLE_LOAD)
CLI tests flipped to match: TestConductorSetup_WarnsOnChannelsWithoutGlobalPlugin
replaces TestConductorSetup_WarnsOnGlobalTelegramEnabled, and the
"clean/silent" case is now the canonical topology (global=true +
--channels) rather than its opposite.
All 5 new assertions are RED against the v1.7.22 validator. GREEN
arrives in the follow-up impl commit that rewrites the validator logic.
Refs #661 #658
Committed by Ashesh Goplani
…logy Removes GLOBAL_ANTIPATTERN and DOUBLE_LOAD codes. Adds CHANNELS_WITHOUT_GLOBAL_PLUGIN for the real misconfiguration: conductor sessions that subscribe via --channels while the plugin is globally disabled, so the plugin never loads and the subscription silently no-ops. WRAPPER_DEPRECATED is unchanged — env_file is still the canonical injection mechanism. Validator surface: Canonical topology (global=true + --channels) → 0 warnings Global-only, non-subscribing session → 0 warnings Channels without global plugin (silent bridge outage) → CHANNELS_WITHOUT_GLOBAL_PLUGIN Wrapper-based TELEGRAM_STATE_DIR with telegram channel → WRAPPER_DEPRECATED Child-env-leak (#658) remains a real problem and its fix is unchanged — this commit only corrects the user-facing warning signal that was telling people to misconfigure themselves out of having a bridge. Refs #661 #658 Committed by Ashesh Goplani
…t entries Three coordinated doc surfaces updated to match the verified topology: - skills/agent-deck/SKILL.md: the "Telegram conductor topology" section now says enable the plugin globally + subscribe per-conductor via --channels + inject TELEGRAM_STATE_DIR via env_file. The anti-patterns table replaces GLOBAL_ANTIPATTERN / DOUBLE_LOAD with the single new code CHANNELS_WITHOUT_GLOBAL_PLUGIN that warns on the real misconfiguration. Adds a "Topology note" explaining why v1.7.22 was reversed so older readers who saw the old guidance don't re-regress. Removes the stale "Many competing telegram pollers" section whose advice to disable globally was the exact cause of the travel-conductor silent-bridge outage. - conductor/conductor-claude.md: runbook preflight + debug checklist now say enable the plugin globally, look for CHANNELS_WITHOUT_GLOBAL_PLUGIN or WRAPPER_DEPRECATED warnings (not the removed v1.7.22 codes), and calls out the topology-note-why-reversed explicitly so any conductor still reading v1.7.22 runbooks is redirected. - .claude/release-tests.yaml: drops the six v1722-telegram-* manifest entries (they pinned the wrong warning direction) and appends 12 new v1723-* entries — 6 for issue #661 history-loss guard (RejectsRebindWhenCurrentHasMoreData, RejectsRebindWhenCurrentHasDataCandidateEmpty, AllowsRebindWhenCurrentIsEmpty, AllowsRebindWhenCurrentIsUnset, RejectsBidirectionalFlap, BuildClaudeResumeCommand_AfterFlap_ResumesRichID) and 6 for the validator reversal (GlobalEnabledWithChannels_NoWarning, GlobalDisabled_WithChannels_WarnChannelsWithoutPlugin, v1_7_22_CodesRemoved, WrapperStateDir_AntiPattern, ReadTelegramGloballyEnabled_, WarnsOnChannelsWithoutGlobalPlugin). Refs #661 #658 Committed by Ashesh Goplani
… manifest entries" Revert "fix(#661): GREEN — telegram validator reversed to match verified topology" Revert "test(#661): RED — v1.7.22 telegram topology warning direction was inverted" The three preceding commits reversed the v1.7.22 telegram topology validator on a reading of in-conductor reports that turned out to be wrong. Codex-reviewed and empirically verified PM 2026-04-18: - enabledPlugins."telegram@claude-plugins-official" = FALSE globally is the correct topology. With global=true, every Task-subagent and agent-deck child auto-loads the plugin and starts its own bun telegram poller → duplicate pollers race on the same bot token → Telegram rejects one with 409 Conflict → "bot goes silent after a while" symptom. - --channels plugin:telegram@... is what ACTIVATES the plugin for a given session. The plugin must be INSTALLED in the profile (claude plugin list shows it), but must NOT be in enabledPlugins globally. - Verification state at revert time: 3 conductor pollers (agent-deck, innotrade, travel) all stable with global=false, each with a distinct TELEGRAM_STATE_DIR, no 409s. This revert restores the v1.7.22 validator (GLOBAL_ANTIPATTERN, DOUBLE_LOAD, WRAPPER_DEPRECATED codes), its tests, the doc sections in skills/agent-deck/SKILL.md and conductor/conductor-claude.md, and the six v1722-telegram-* manifest entries in .claude/release-tests.yaml. Fix (a) for #661 (size-aware hook rebind guard in UpdateHookStatus) is untouched — it lives in b2104ef and d1ee43d, which are orthogonal to the telegram direction question. The six v1723-history-loss-* manifest entries dropped by this revert will be re-added in a follow-up commit. Refs #661 #658 Committed by Ashesh Goplani
Six new entries pinning the UpdateHookStatus size-aware rebind guard added in b2104ef. These tests are permanent — any future change to Instance.UpdateHookStatus or sessionConversationByteSize must run them. Entries: - v1723-history-loss-rebind-rejects-less-data (the main #661 guard) - v1723-history-loss-rebind-empty-candidate (v1.7.7 fallback preserved) - v1723-history-loss-rebind-allows-current-empty (healthy bring-up) - v1723-history-loss-rebind-allows-cold-start (first bind accepted) - v1723-history-loss-flap-replay (3-cycle travel-conductor replay) - v1723-history-loss-boundary-resume (buildClaudeResumeCommand pin) Separate from the reverted telegram-direction commits so the manifest stays aligned with the code shipped. Refs #661 Committed by Ashesh Goplani
Ships fix for #661 (conductor history-loss guard on the UserPromptSubmit hook rebind path). Size-aware rejection prevents a fresh 1-record jsonl from overwriting a rich historic jsonl when both pass the v1.7.7 has-any-data check. Structural follow-up tracked as #662 (sessionHasConversationData false-negative in buildClaudeResumeCommand) — v1.7.24 scope, not shipped here. Refs #661 Committed by Ashesh Goplani
4 tasks
asheshgoplani
added a commit
that referenced
this pull request
May 6, 2026
…856) (#883) * docs: add explainer images for each user-facing concept Five PNG diagrams in documentation/assets/ embedded into the matching guides. Generated via codex CLI's built-in image_gen tool (gpt-image-2), Tokyo Night palette, technical-architecture style. - conductor-overview.png → CONDUCTOR.md hero - channels-topology.png → CONDUCTOR.md (one-bot-per-conductor) - watcher-doorbell.png → WATCHERS.md hero - skills-tiers.png → SKILLS.md hero - watchdog-restart.png → WATCHDOG.md hero Total ~5.6MB; each image roughly 1MB. PNG-only (gpt-image-2 doesn't emit SVG). Recipe to regenerate is in ~/.agent-deck/conductor/CLAUDE.md in case any concept evolves and the diagrams need refreshing. No code changes. * fix(session): rebind on /clear-created new jsonl regardless of size (#856) The v1.7.23 size-guard added in PR #664 (issue #661) used a strict `<=` byte comparison: candidate must have strictly more bytes than current to win a hook rebind. This stops the UserPromptSubmit flap that lets a fresh 1-record jsonl overwrite a 974KB historic jsonl, but also rejects every user-initiated new session — `/clear`, plain `claude`, picker `--resume` — since fresh sessions are smaller by definition. The tile stays bound to the old uuid until the new jsonl outgrows it in bytes (~209KB in the real-install evidence). Fix: add an mtime-gap escape hatch. In a #661 flap the user keeps typing into the rich session, so its mtime stays fresh; in /clear the user abandons the old session, so its mtime stales while the new jsonl's mtime advances. When candidate.mtime - current.mtime >= 5s (clearRebindMtimeGrace), treat the candidate as a user-initiated new session and rebind regardless of size. 5s is well above the ~2s hook poll cadence (flaps touch both files inside that window) but well below the time a /clear + follow-up prompt takes. The new helper sessionConversationMtime mirrors sessionConversationByteSize exactly — same per-instance config-dir lookup, same all-projects fallback, same graceful degradation to zero on missing file. Coverage: - TestInstance_UpdateHookStatus_ClearCreatesNewSession_RebindsRegardlessOfSize reproduces the issue: 200KB old jsonl with 2-min-old mtime, 1-record new jsonl with current mtime → must rebind to new id. - All existing #661 protection tests still pass: flap test seeds both files within microseconds, so mtime gap is well below 5s and the size guard remains the deciding factor. Committed by Ashesh Goplani
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #661. Adds a size-aware rebind guard to
Instance.UpdateHookStatus's Claude path so a fresh 1-record jsonl cannot overwrite a rich historic jsonl on theUserPromptSubmitflap. This is the missing counterpart to v1.7.7, which fixed thetmux_envrebind path but left thehook_payloadpath with a weaker binary has-any-data check. Filed #662 for the structural follow-up inbuildClaudeResumeCommand(v1.7.24 scope).Root cause (manifestation 3)
Travel conductor repro (3× in 7 hours):
By the time
UserPromptSubmitarrives, the fresh jsonl holds exactly 1 record and passessessionHasConversationData. v1.7.7's binary check let the rebind through, silently replacing the rich 974KB jsonl with the fresh 1-record one. On the next restart,buildClaudeResumeCommandresumed the fresh ID and history evaporated.Fix
New helper
sessionConversationByteSize(inst, sessionID) int64returns jsonl file size in bytes.UpdateHookStatusnow decides:i.ClaudeSessionID == "") → accept (first bind).sessionIdrecords → rejectcandidate_has_no_conversation_data(v1.7.7, preserved).candidate_has_less_conversation_data.Byte size is a robust "how much history this session holds" proxy — immune to record-count ties and faster than rescanning.
The
tmux_envrebind path (UpdateClaudeSession, line 2750-2781) already had a stronger guard; this brings thehook_payloadpath to parity. The successful-bind bookkeeping is extracted intobindClaudeSessionFromHookso cold-start and rebind paths share one implementation.Tests
Six new tests in
internal/session/instance_test.go:TestInstance_UpdateHookStatus_RejectsRebindWhenCurrentHasMoreData— the Conductor history loss — UserPromptSubmit rebind path lacks 'current_has_more_data' guard (manifestation 3) #661 pin.TestInstance_UpdateHookStatus_RejectsRebindWhenCurrentHasDataCandidateEmpty— v1.7.7 path preserved.TestInstance_UpdateHookStatus_AllowsRebindWhenCurrentIsEmpty— healthy first-prompt rebind.TestInstance_UpdateHookStatus_AllowsRebindWhenCurrentIsUnset— cold start.TestInstance_UpdateHookStatus_RejectsBidirectionalFlap— 3-cycle travel-conductor replay.TestInstance_BuildClaudeResumeCommand_AfterFlap_ResumesRichID— live-runtime boundary: the emitted restart command must contain--resume <rich-uuid>.Six matching regression entries appended to
.claude/release-tests.yaml.Telegram-direction non-event (three reverted commits in history)
Three commits (
43953e9,b1b53f5,24bb286) reversed the v1.7.22 telegram topology validator based on a wrong interim reading. Codex-reviewed and empirically verified in the same window: v1.7.22's original direction (enabledPlugins.telegram@...=falseglobally + per-conductor--channels) is correct.53526bfreverts those three; the v1.7.22 validator, tests, and docs are unchanged from main. The revert is preserved in history as an honest record.Test plan
go test ./... -race -count=1 -timeout 900s— green across 30+ packagesTestInstance_UpdateHookStatus_RejectsBidirectionalFlap+TestInstance_BuildClaudeResumeCommand_AfterFlap_ResumesRichIDwith-race— all greenTestPersistence_*(mandatory session-lifecycle gate) — greenCommitted by Ashesh Goplani