Skip to content

fix(#661): v1.7.23 — size-aware hook rebind guard stops conductor history loss#664

Merged
asheshgoplani merged 8 commits intomainfrom
fix/v1723-history-loss-guard
Apr 18, 2026
Merged

fix(#661): v1.7.23 — size-aware hook rebind guard stops conductor history loss#664
asheshgoplani merged 8 commits intomainfrom
fix/v1723-history-loss-guard

Conversation

@asheshgoplani
Copy link
Copy Markdown
Owner

Summary

Closes #661. Adds a size-aware rebind guard to Instance.UpdateHookStatus's Claude path so a fresh 1-record jsonl cannot overwrite a rich historic jsonl on the UserPromptSubmit flap. This is the missing counterpart to v1.7.7, which fixed the tmux_env rebind path but left the hook_payload path with a weaker binary has-any-data check. Filed #662 for the structural follow-up in buildClaudeResumeCommand (v1.7.24 scope).

Root cause (manifestation 3)

Travel conductor repro (3× in 7 hours):

bind    hook_payload  new=dd17cb25   SessionStart           (original, 974KB jsonl)
rebind  hook_payload  50fe72cc→dd17cb25  SessionEnd
reject  hook_payload  dd17cb25  reason=candidate_has_no_conversation_data
rebind  hook_payload  dd17cb25→50fe72cc  UserPromptSubmit   ← THE BUG

By the time UserPromptSubmit arrives, the fresh jsonl holds exactly 1 record and passes sessionHasConversationData. v1.7.7's binary check let the rebind through, silently replacing the rich 974KB jsonl with the fresh 1-record one. On the next restart, buildClaudeResumeCommand resumed the fresh ID and history evaporated.

Fix

New helper sessionConversationByteSize(inst, sessionID) int64 returns jsonl file size in bytes. UpdateHookStatus now decides:

  1. Cold start (i.ClaudeSessionID == "") → accept (first bind).
  2. Candidate has zero sessionId records → reject candidate_has_no_conversation_data (v1.7.7, preserved).
  3. Both sides have data → candidate must be strictly larger in bytes. Otherwise reject candidate_has_less_conversation_data.

Byte size is a robust "how much history this session holds" proxy — immune to record-count ties and faster than rescanning.

The tmux_env rebind path (UpdateClaudeSession, line 2750-2781) already had a stronger guard; this brings the hook_payload path to parity. The successful-bind bookkeeping is extracted into bindClaudeSessionFromHook so cold-start and rebind paths share one implementation.

Tests

Six new tests in internal/session/instance_test.go:

  • TestInstance_UpdateHookStatus_RejectsRebindWhenCurrentHasMoreData — the Conductor history loss — UserPromptSubmit rebind path lacks 'current_has_more_data' guard (manifestation 3) #661 pin.
  • TestInstance_UpdateHookStatus_RejectsRebindWhenCurrentHasDataCandidateEmpty — v1.7.7 path preserved.
  • TestInstance_UpdateHookStatus_AllowsRebindWhenCurrentIsEmpty — healthy first-prompt rebind.
  • TestInstance_UpdateHookStatus_AllowsRebindWhenCurrentIsUnset — cold start.
  • TestInstance_UpdateHookStatus_RejectsBidirectionalFlap — 3-cycle travel-conductor replay.
  • TestInstance_BuildClaudeResumeCommand_AfterFlap_ResumesRichID — live-runtime boundary: the emitted restart command must contain --resume <rich-uuid>.

Six matching regression entries appended to .claude/release-tests.yaml.

Telegram-direction non-event (three reverted commits in history)

Three commits (43953e9, b1b53f5, 24bb286) reversed the v1.7.22 telegram topology validator based on a wrong interim reading. Codex-reviewed and empirically verified in the same window: v1.7.22's original direction (enabledPlugins.telegram@...=false globally + per-conductor --channels) is correct. 53526bf reverts those three; the v1.7.22 validator, tests, and docs are unchanged from main. The revert is preserved in history as an honest record.

Test plan

  • go test ./... -race -count=1 -timeout 900s — green across 30+ packages
  • 5× consecutive TestInstance_UpdateHookStatus_RejectsBidirectionalFlap + TestInstance_BuildClaudeResumeCommand_AfterFlap_ResumesRichID with -race — all green
  • TestPersistence_* (mandatory session-lifecycle gate) — green
  • Telegram validator suite — green (v1.7.22 state preserved)
  • GH Actions on this PR — pending

Committed by Ashesh Goplani

…history loss

Adds 5 new UpdateHookStatus tests pinning the manifestation-3 flap:
UserPromptSubmit arrives with a fresh 1-record jsonl ID and silently
overwrites the rich hundreds-of-KB historic jsonl ID on every restart.

Tests:
- RejectsRebindWhenCurrentHasMoreData — pins size-based rejection
- RejectsRebindWhenCurrentHasDataCandidateEmpty — preserves v1.7.7 guard
- AllowsRebindWhenCurrentIsEmpty — healthy first-prompt rebind
- AllowsRebindWhenCurrentIsUnset — cold start
- RejectsBidirectionalFlap — replays 3× travel conductor flap from lifecycle log

All 5 are RED against v1.7.22. GREEN arrives in the follow-up impl commit.

Refs #661
Committed by Ashesh Goplani
…ry loss

Adds a third decision branch to Instance.UpdateHookStatus's claude path:
when BOTH current and candidate session IDs have conversation data, the
candidate must have STRICTLY MORE bytes on disk to win the rebind. This
plugs manifestation 3 of the conductor history-loss flap where:

  1. Fresh claude boots → SessionStart with empty jsonl → v1.7.7 rejects ✓
  2. User types first prompt → UserPromptSubmit arrives with the
     candidate jsonl now holding exactly 1 record
  3. v1.7.7's binary has-any-data check passes → rebind executes and
     silently overwrites the rich 974KB historic jsonl

Byte size is the robust "how much history this session holds" proxy —
immune to record-count ties and faster than rescanning the file. The
tmux_env rebind path (UpdateClaudeSession, line 2750-2781) already had a
stronger guard; this brings the hook_payload path to parity.

New helper: sessionConversationByteSize(inst, sessionID) int64 — reuses
the per-instance config-dir lookup + findSessionFileInAllProjects
fallback used by sessionHasConversationData.

Refactor: the successful-bind bookkeeping (lifecycle event, field
assignments, tmux env propagation) extracted into
bindClaudeSessionFromHook so cold-start and rebind paths share one
implementation instead of copy-pasting.

Tests: all 5 new TestInstance_UpdateHookStatus_* tests pass with -race,
plus TestInstance_BuildClaudeResumeCommand_AfterFlap_ResumesRichID pins
the live boundary (the next restart's --resume argument must be the rich
ID, not the overwritten fresh one).

Refs #661
Committed by Ashesh Goplani
…erted

Production verification showed the v1.7.22 model was wrong. --channels
only SUBSCRIBES a session to a channel; it does not force-load the
plugin. So:

  - enabledPlugins.telegram@...=true is REQUIRED, not an anti-pattern —
    it's how the plugin actually loads into the claude process.
  - Global enablement + --channels does NOT double-load; the plugin
    loads once, and --channels wires the subscription for whichever
    session owns the channel. Other sessions simply never subscribe.
  - The real misconfiguration is the opposite: global DISABLED +
    --channels on a conductor. The plugin never loads, --channels is a
    no-op, and the conductor has no telegram bridge at all (silent
    outage).

New test suite expresses the corrected semantics:

  - GlobalEnabledWithChannels_NoWarning (canonical topology)
  - GlobalEnabled_NoChannels_NoWarning (non-subscribing sessions fine)
  - GlobalDisabled_NoChannels_NoWarning (nothing to warn about)
  - GlobalDisabled_WithChannels_WarnChannelsWithoutPlugin (THE new warning)
  - GlobalDisabled_WithNonTelegramChannels_NoWarning (prefix-match guard)
  - WrapperStateDir_AntiPattern (WRAPPER_DEPRECATED unchanged)
  - WrapperStateDir_NoTelegramChannel_NoWarning (unchanged)
  - v1_7_22_CodesRemoved (permanence guard for GLOBAL_ANTIPATTERN/DOUBLE_LOAD)

CLI tests flipped to match: TestConductorSetup_WarnsOnChannelsWithoutGlobalPlugin
replaces TestConductorSetup_WarnsOnGlobalTelegramEnabled, and the
"clean/silent" case is now the canonical topology (global=true +
--channels) rather than its opposite.

All 5 new assertions are RED against the v1.7.22 validator. GREEN
arrives in the follow-up impl commit that rewrites the validator logic.

Refs #661 #658
Committed by Ashesh Goplani
…logy

Removes GLOBAL_ANTIPATTERN and DOUBLE_LOAD codes. Adds
CHANNELS_WITHOUT_GLOBAL_PLUGIN for the real misconfiguration: conductor
sessions that subscribe via --channels while the plugin is globally
disabled, so the plugin never loads and the subscription silently
no-ops. WRAPPER_DEPRECATED is unchanged — env_file is still the
canonical injection mechanism.

Validator surface:

  Canonical topology (global=true + --channels)         → 0 warnings
  Global-only, non-subscribing session                  → 0 warnings
  Channels without global plugin (silent bridge outage) → CHANNELS_WITHOUT_GLOBAL_PLUGIN
  Wrapper-based TELEGRAM_STATE_DIR with telegram channel → WRAPPER_DEPRECATED

Child-env-leak (#658) remains a real problem and its fix is unchanged —
this commit only corrects the user-facing warning signal that was
telling people to misconfigure themselves out of having a bridge.

Refs #661 #658
Committed by Ashesh Goplani
…t entries

Three coordinated doc surfaces updated to match the verified topology:

- skills/agent-deck/SKILL.md: the "Telegram conductor topology" section
  now says enable the plugin globally + subscribe per-conductor via
  --channels + inject TELEGRAM_STATE_DIR via env_file. The anti-patterns
  table replaces GLOBAL_ANTIPATTERN / DOUBLE_LOAD with the single new
  code CHANNELS_WITHOUT_GLOBAL_PLUGIN that warns on the real
  misconfiguration. Adds a "Topology note" explaining why v1.7.22 was
  reversed so older readers who saw the old guidance don't re-regress.
  Removes the stale "Many competing telegram pollers" section whose
  advice to disable globally was the exact cause of the travel-conductor
  silent-bridge outage.

- conductor/conductor-claude.md: runbook preflight + debug checklist now
  say enable the plugin globally, look for CHANNELS_WITHOUT_GLOBAL_PLUGIN
  or WRAPPER_DEPRECATED warnings (not the removed v1.7.22 codes), and
  calls out the topology-note-why-reversed explicitly so any conductor
  still reading v1.7.22 runbooks is redirected.

- .claude/release-tests.yaml: drops the six v1722-telegram-* manifest
  entries (they pinned the wrong warning direction) and appends 12 new
  v1723-* entries — 6 for issue #661 history-loss guard
  (RejectsRebindWhenCurrentHasMoreData, RejectsRebindWhenCurrentHasDataCandidateEmpty,
  AllowsRebindWhenCurrentIsEmpty, AllowsRebindWhenCurrentIsUnset,
  RejectsBidirectionalFlap, BuildClaudeResumeCommand_AfterFlap_ResumesRichID)
  and 6 for the validator reversal
  (GlobalEnabledWithChannels_NoWarning,
  GlobalDisabled_WithChannels_WarnChannelsWithoutPlugin,
  v1_7_22_CodesRemoved, WrapperStateDir_AntiPattern,
  ReadTelegramGloballyEnabled_, WarnsOnChannelsWithoutGlobalPlugin).

Refs #661 #658
Committed by Ashesh Goplani
… manifest entries"

Revert "fix(#661): GREEN — telegram validator reversed to match verified topology"
Revert "test(#661): RED — v1.7.22 telegram topology warning direction was inverted"

The three preceding commits reversed the v1.7.22 telegram topology
validator on a reading of in-conductor reports that turned out to be
wrong. Codex-reviewed and empirically verified PM 2026-04-18:

- enabledPlugins."telegram@claude-plugins-official" = FALSE globally is
  the correct topology. With global=true, every Task-subagent and
  agent-deck child auto-loads the plugin and starts its own bun
  telegram poller → duplicate pollers race on the same bot token →
  Telegram rejects one with 409 Conflict → "bot goes silent after a
  while" symptom.
- --channels plugin:telegram@... is what ACTIVATES the plugin for a
  given session. The plugin must be INSTALLED in the profile
  (claude plugin list shows it), but must NOT be in enabledPlugins
  globally.
- Verification state at revert time: 3 conductor pollers (agent-deck,
  innotrade, travel) all stable with global=false, each with a
  distinct TELEGRAM_STATE_DIR, no 409s.

This revert restores the v1.7.22 validator (GLOBAL_ANTIPATTERN,
DOUBLE_LOAD, WRAPPER_DEPRECATED codes), its tests, the doc sections in
skills/agent-deck/SKILL.md and conductor/conductor-claude.md, and the
six v1722-telegram-* manifest entries in .claude/release-tests.yaml.

Fix (a) for #661 (size-aware hook rebind guard in UpdateHookStatus) is
untouched — it lives in b2104ef and d1ee43d, which are orthogonal to
the telegram direction question. The six v1723-history-loss-* manifest
entries dropped by this revert will be re-added in a follow-up commit.

Refs #661 #658
Committed by Ashesh Goplani
Six new entries pinning the UpdateHookStatus size-aware rebind guard
added in b2104ef. These tests are permanent — any future change to
Instance.UpdateHookStatus or sessionConversationByteSize must run them.

Entries:
- v1723-history-loss-rebind-rejects-less-data (the main #661 guard)
- v1723-history-loss-rebind-empty-candidate (v1.7.7 fallback preserved)
- v1723-history-loss-rebind-allows-current-empty (healthy bring-up)
- v1723-history-loss-rebind-allows-cold-start (first bind accepted)
- v1723-history-loss-flap-replay (3-cycle travel-conductor replay)
- v1723-history-loss-boundary-resume (buildClaudeResumeCommand pin)

Separate from the reverted telegram-direction commits so the manifest
stays aligned with the code shipped.

Refs #661
Committed by Ashesh Goplani
Ships fix for #661 (conductor history-loss guard on the UserPromptSubmit
hook rebind path). Size-aware rejection prevents a fresh 1-record jsonl
from overwriting a rich historic jsonl when both pass the v1.7.7
has-any-data check.

Structural follow-up tracked as #662 (sessionHasConversationData
false-negative in buildClaudeResumeCommand) — v1.7.24 scope, not
shipped here.

Refs #661
Committed by Ashesh Goplani
@asheshgoplani asheshgoplani merged commit ab3dce4 into main Apr 18, 2026
2 of 4 checks passed
@asheshgoplani asheshgoplani deleted the fix/v1723-history-loss-guard branch April 23, 2026 06:52
asheshgoplani added a commit that referenced this pull request May 6, 2026
…856) (#883)

* docs: add explainer images for each user-facing concept

Five PNG diagrams in documentation/assets/ embedded into the matching
guides. Generated via codex CLI's built-in image_gen tool (gpt-image-2),
Tokyo Night palette, technical-architecture style.

- conductor-overview.png  → CONDUCTOR.md hero
- channels-topology.png   → CONDUCTOR.md (one-bot-per-conductor)
- watcher-doorbell.png    → WATCHERS.md hero
- skills-tiers.png        → SKILLS.md hero
- watchdog-restart.png    → WATCHDOG.md hero

Total ~5.6MB; each image roughly 1MB. PNG-only (gpt-image-2 doesn't
emit SVG). Recipe to regenerate is in ~/.agent-deck/conductor/CLAUDE.md
in case any concept evolves and the diagrams need refreshing.

No code changes.

* fix(session): rebind on /clear-created new jsonl regardless of size (#856)

The v1.7.23 size-guard added in PR #664 (issue #661) used a strict `<=`
byte comparison: candidate must have strictly more bytes than current to
win a hook rebind. This stops the UserPromptSubmit flap that lets a fresh
1-record jsonl overwrite a 974KB historic jsonl, but also rejects every
user-initiated new session — `/clear`, plain `claude`, picker `--resume`
— since fresh sessions are smaller by definition. The tile stays bound
to the old uuid until the new jsonl outgrows it in bytes (~209KB in the
real-install evidence).

Fix: add an mtime-gap escape hatch. In a #661 flap the user keeps typing
into the rich session, so its mtime stays fresh; in /clear the user
abandons the old session, so its mtime stales while the new jsonl's
mtime advances. When candidate.mtime - current.mtime >= 5s
(clearRebindMtimeGrace), treat the candidate as a user-initiated new
session and rebind regardless of size. 5s is well above the ~2s hook
poll cadence (flaps touch both files inside that window) but well below
the time a /clear + follow-up prompt takes.

The new helper sessionConversationMtime mirrors sessionConversationByteSize
exactly — same per-instance config-dir lookup, same all-projects fallback,
same graceful degradation to zero on missing file.

Coverage:
- TestInstance_UpdateHookStatus_ClearCreatesNewSession_RebindsRegardlessOfSize
  reproduces the issue: 200KB old jsonl with 2-min-old mtime, 1-record
  new jsonl with current mtime → must rebind to new id.
- All existing #661 protection tests still pass: flap test seeds both
  files within microseconds, so mtime gap is well below 5s and the size
  guard remains the deciding factor.

Committed by Ashesh Goplani
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Conductor history loss — UserPromptSubmit rebind path lacks 'current_has_more_data' guard (manifestation 3)

1 participant