Skip to content

feat: pre-assembly cache-TTL compaction#2

Closed
liu51115 wants to merge 57 commits intoelectricsheephq:mainfrom
liu51115:feat/cache-ttl-compaction
Closed

feat: pre-assembly cache-TTL compaction#2
liu51115 wants to merge 57 commits intoelectricsheephq:mainfrom
liu51115:feat/cache-ttl-compaction

Conversation

@liu51115
Copy link
Copy Markdown

@liu51115 liu51115 commented Apr 10, 2026

Summary

Replace afterTurn cache-state-based compaction with assembly-path TTL-based trigger. Based on v0.8.0.

Problem

evaluateIncrementalCompaction() runs in afterTurn() and reads the cache status of the call that just completed. This is a timing inversion:

Solution

1. Pre-assembly compaction (new)

Before assembling context, check: idle > cacheTTLSeconds (default 300s) AND memory pressure? → compact before assembly.

2. Simplify afterTurn

Remove hot-cache-budget-headroom, hot-cache-defer, cold-cache-catchup. Keep budget-trigger safety valve and simple leaf-trigger.

Changes

  • config.ts: Add cacheTTLSeconds (default 300, env LCM_CACHE_TTL_SECONDS)
  • migration.ts: Add last_api_call_at column
  • compaction-telemetry-store.ts: Read/write lastApiCallAt
  • engine.ts: Pre-assembly check in assemble(), simplified evaluateIncrementalCompaction()
  • config.test.ts: Updated assertions

Closes Martian-Engineering#367. Related: Martian-Engineering#358, Martian-Engineering#362, Martian-Engineering#363.

Summary by CodeRabbit

  • New Features

    • Added /lossless command (alias /lcm) for health checks, diagnostics, and conversation cleanup.
    • Introduced /lcm doctor clean apply for automated garbage collection of archived sessions.
    • Extended search with sort options: recency, relevance, and hybrid ranking.
    • Multi-conversation synthesis in lcm_expand_query with per-conversation diagnostics.
  • Documentation

    • Added bundled skill with operational guides for configuration, architecture, and diagnostics.
    • Expanded README with command usage and configuration examples.
  • Configuration

    • Added new configuration options for compaction behavior, cache tuning, and timeout settings.

tmchow and others added 30 commits April 3, 2026 14:49
…t content (Martian-Engineering#235)

* fix: preserve text block structure when externalizing large toolResult content

When a toolResult message contains a plain-text content block
({type: "text", text: "..."}) that exceeds the externalization
threshold, interceptLargeToolResults now keeps {type: "text", text: ref}
instead of rewriting to {type: "tool_result", output: ref}.

This prevents the amazon-bedrock provider from crashing on
sanitizeSurrogates(c.text) when c.text is undefined.

The assembler path also reads rawType from stored metadata so
reassembled blocks reconstruct the correct part type.

Fixes Martian-Engineering#196

* fix: restore text blocks for externalized tool results

Make the assembler reconstruct externalized plain-text tool results as
`{ type: "text", text: ... }` instead of forcing them back through the
`tool_result`/`output` shape. Tighten the regression tests so they assert
the exact assembled block shape, and add assembler coverage for the
externalized-text path.

Regeneration-Prompt: |
  Review feedback on PR 235 showed the previous change only altered how
  large plain-text tool results were stored, not how they were assembled
  back into runtime messages. The bug report was that Bedrock reads
  `c.text` for plain text tool-result content, and the PR still rebuilt
  those externalized blocks as `tool_result` objects with `output`, so the
  provider would still see `undefined`.

  Fix the round-trip at the assembler layer with the smallest additive
  change. Preserve existing behavior for structured tool results and
  function_call_output blocks. Add regression tests that fail unless the
  assembled block is actually `type: "text"` with a `text` field, and add
  focused assembler coverage for the externalized plain-text case.

---------

Co-authored-by: Josh Lehman <josh@martian.engineering>
…-Engineering#248)

When tool-use-only assistant turns are stored with content='' and zero
message_parts, or when filterNonFreshAssistantToolCalls strips all
tool_use blocks from a non-fresh assistant message, the resulting
content array is empty ([]) or the content string is falsy.

Anthropic (and other providers) reject messages with empty content:
  'The content field in the Message object at messages.0 is empty'

Add an explicit filter in assemble() to remove these empty assistant
messages before passing to sanitizeToolUseResultPairing and the API.
The filter only targets assistant messages — user messages with empty
content are left untouched (provider may handle differently).

Closes Martian-Engineering#238

Co-authored-by: wujiaming88 <wujiaming88@example.com>
Martian-Engineering#258)

* fix: harden bootstrap budget against oversized messages and NaN config

Two bugs in the bootstrap budget cap introduced in Martian-Engineering#255:

1. A single oversized tail message bypasses the budget entirely.
   The trim loop condition 'if (kept.length > 0 && ...)' means the
   first message (newest) is always kept regardless of size. A 50K-token
   tool result as the last message will bypass a 6K budget. Fix: after
   the loop, check if the single kept message exceeds budget and return
   empty instead of silently bypassing.

2. NaN propagates through all numeric env config parsing.
   parseInt('oops', 10) returns NaN, which is not nullish, so
   ?? fallback never fires. Invalid env like LCM_LEAF_CHUNK_TOKENS=oops
   propagates NaN through leafChunkTokens, bootstrapMaxTokens, and every
   derived config value — effectively disabling all token budgets.

   Fix: add parseFiniteInt/parseFiniteNumber helpers that return undefined
   for non-finite results. Replace all 16 raw parseInt/parseFloat calls
   in resolveLcmConfig() with the safe helpers.

Both bugs were found and reproduced with minimal scripts during
adversarial review of a production incident.

* test: cover bootstrap and env fallback regressions

Add focused regression tests for the oversized singleton bootstrap tail case and invalid numeric env parsing fallback behavior. Add a patch changeset because this PR changes runtime behavior and should be reflected in release notes.

Regeneration-Prompt: |
  The open PR fixed two production regressions but still lacked the release and test follow-through needed to merge. Add targeted regression coverage instead of broad refactors: one config test that proves invalid numeric env values like LCM_LEAF_CHUNK_TOKENS=oops fall back through plugin/default resolution, and one bootstrap test that proves a single oversized tail message is dropped instead of bypassing bootstrapMaxTokens. Also add a patch changeset because the PR changes runtime behavior visible to users and maintainers expect release notes coverage for that.

---------

Co-authored-by: Eva <eva@100yen.org>
Co-authored-by: Josh Lehman <josh@martian.engineering>
…artian-Engineering#222)

* Initial plan

* fix: block concurrent expand-query delegation per origin session

Agent-Logs-Url: https://github.com/Martian-Engineering/lossless-claw/sessions/46499c08-a52b-4640-9235-d4505936b758

Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>

* test: simplify concurrent expand-query gate fixture

Agent-Logs-Url: https://github.com/Martian-Engineering/lossless-claw/sessions/46499c08-a52b-4640-9235-d4505936b758

Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>

* docs: add changeset for expand-query concurrency fix

Agent-Logs-Url: https://github.com/Martian-Engineering/lossless-claw/sessions/46499c08-a52b-4640-9235-d4505936b758

Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>

* fix: narrow expand-query concurrency gating

Delay origin-session concurrency slot acquisition until lcm_expand_query has
resolved scope and found summary IDs to delegate. This preserves the
concurrency block for real delegated sub-agent work without blocking
overlapping no-op or no-match requests that never touch the shared lane.

Add a regression test covering concurrent query calls that return no matches
so harmless probes remain unblocked.

Regeneration-Prompt: |
  Address the PR review finding that the new lcm_expand_query concurrency slot
  was acquired too early. Preserve the intended deadlock prevention for real
  delegated sub-agent runs, but do not serialize requests that exit before any
  delegation happens, such as missing-scope or no-match query paths. Keep the
  existing concurrency-block behavior for actual delegated expansions and add a
  regression test proving concurrent no-match requests both complete normally
  without any gateway agent calls.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Co-authored-by: Josh Lehman <josh@martian.engineering>
…artian-Engineering#180)

* feat: prompt-aware context assembly with BM25-lite relevance scoring

When the token budget is exceeded during context assembly, evictable items
are now scored by relevance to the current user prompt (BM25-lite TF keyword
scoring) rather than dropped in strict chronological order. This means
summaries matching the user's active query are preserved over irrelevant
but more recent content.

- Add `prompt?: string` to AssembleContextInput and LcmContextEngine.assemble()
- Add `text: string` to ResolvedItem for pre-extracted scoring content
- Implement scoreRelevance() using TF-based keyword overlap (no deps, no LLM)
- Fall back to existing chronological eviction when prompt is absent or empty
- Add 6 integration tests covering prompt-aware eviction, fallback, and edge cases

Refs OpenClaw PR #50848. Zero cost increase, fully backwards compatible.

* chore: gitignore CE plan artifacts and TASK.md

* test: add unit tests for BM25-lite scoreRelevance and tokenizeText

Export scoreRelevance and tokenizeText (with @internal JSDoc) for direct
unit testing. Add 13 new tests covering edge cases: empty inputs, no
overlap, case insensitivity, prompt term deduplication, single-char
filtering, and relative scoring. Fix inaccurate docstring that claimed
[0,1] bounded range.

* fix: fall back on unsearchable assembly prompts

Treat prompt-aware assembly as opt-in only when the prompt contains at least one searchable term. Blank or whitespace-only prompts now follow the existing chronological eviction path, and the integration suite covers that regression. Add a patch changeset because this fixes user-visible assembly behavior in the plugin.

Regeneration-Prompt: |
  Review found that prompt-aware context eviction switched behavior on any non-empty prompt string, even when the string had no searchable terms after tokenization. Preserve the new relevance feature, but make blank, whitespace-only, or otherwise unsearchable prompts fall back to the existing chronological eviction path so behavior matches the docs and tests. Keep the change minimal in the assembler, add an integration test that proves whitespace-only prompts keep the chronological result, update public comments to reflect the actual contract, and add a patch changeset because this affects user-visible context assembly behavior.

---------

Co-authored-by: Josh Lehman <josh@martian.engineering>
…an-Engineering#257)

* fix: harden afterTurn dedup guard against false-positive drops

Improves the replay dedup introduced in Martian-Engineering#246 with two fixes:

1. Replace hasMessage() fast-path with aligned-tail boundary check.
   The old approach checks if batch[0] exists *anywhere* in the DB,
   which false-positives on legitimate repeated first messages (e.g.
   user sends 'hello' again). The new check verifies the DB's last
   message aligns with the exact replay boundary position in the
   incoming batch.

2. Run dedup on newMessages before prepending autoCompactionSummary.
   The merged Martian-Engineering#246 deduplicates the full ingestBatch including the
   synthetic summary, which can interfere with replay detection when
   the summary content matches historical messages.

Both changes are conservative: any mismatch falls through to the
existing full ordered-prefix proof, and mismatches always preserve
the batch unchanged (no data loss on false negatives).

* fix: repair afterTurn dedup ingest batch

Fix the follow-up replay dedup change so afterTurn passes the constructed ingest batch into ingestBatch instead of referencing a removed variable. Add a regression test covering restart replay when auto-compaction summary text is prepended, and include a patch changeset for release notes.

Regeneration-Prompt: |
  Review PR 257 in lossless-claw and fix the blocking typo left in the
  afterTurn replay-dedup follow-up. Preserve the aligned-tail replay
  detection approach, keep the fix additive, and avoid changing unrelated
  behavior. Add targeted regression coverage for the summary-prepend edge
  case that the PR description calls out, then add a patch changeset so the
  data-loss hardening lands in release notes. Validate with the repo's
  existing vitest binary from the main checkout because the PR worktree does
  not have its own node_modules.

---------

Co-authored-by: Eva <eva@100yen.org>
Co-authored-by: Josh Lehman <josh@martian.engineering>
…neering#229)

* fix: parse SQLite UTC timestamps with explicit Z suffix

SQLite datetime('now') stores UTC timestamps without a Z suffix.
JavaScript's Date constructor parses bare datetime strings as local time
per ECMA-262, causing timestamps to shift by the local timezone offset.

This adds a parseUtcTimestamp() helper that appends 'Z' before parsing,
and applies it to all new Date(row.*) calls in conversation-store,
summary-store, and migration.

Fixes Martian-Engineering#216

* fix: preserve explicit timestamp offsets

Keep explicit timezone offsets intact in the shared timestamp parser while still normalizing bare SQLite datetime('now') values to UTC. Add focused parser coverage for bare, Z-suffixed, and offset-bearing timestamps, and include a patch changeset for the behavior fix.

Regeneration-Prompt: |
  Address the PR review finding on the shared SQLite timestamp parser introduced for issue Martian-Engineering#216. Preserve the intended fix for bare datetime('now') strings that lack a timezone suffix, but do not break timestamps that already include Z or an explicit offset like +02:00. Add narrow tests that prove all three cases still parse correctly, and include a patch changeset because this affects user-visible timestamp handling.

---------

Co-authored-by: Nemo (docs-sync) <nemo@caeli.ai>
Co-authored-by: Josh Lehman <josh@martian.engineering>
* docs: add Chinese README (README_zh.md)

* docs: 更新相關倉庫連結(新命名)

* feat: CJK trigram FTS search with OR semantics

FTS5 unicode61 tokenizer cannot segment CJK ideographs (Chinese, Japanese,
Korean), so CJK queries fall back to a LIKE path with AND logic. When the
user's phrasing doesn't exactly match the summary text (e.g. querying
"端到端测试结果" when the summary contains "端到端测试"), ALL terms
must match and the query returns zero candidates.

This commit adds:

1. A new FTS5 trigram-tokenized virtual table (summaries_fts_cjk) that
   indexes every 3-character substring, enabling native CJK substring
   matching.

2. searchCjkTrigram() — splits CJK segments into overlapping 4-char
   chunks and combines them with OR semantics via FTS5 MATCH. Non-CJK
   tokens (English, version numbers) are searched in the existing porter
   FTS table. Results are unioned and sorted by recency.

3. searchLikeCjk() — a fallback when the trigram table is unavailable.
   Splits CJK text into bigrams (2-char sliding window) and uses LIKE
   with OR instead of AND, so partial matches return results.

4. Auto-migration: creates summaries_fts_cjk and backfills from existing
   summaries on first run. New summaries are indexed on save.

Tested on 4 machines with Chinese query workloads:
- Before: "端到端测试结果" → 0 candidates
- After:  "端到端测试结果" → correct matches via trigram OR

Fixes CJK zero-result bug affecting all Chinese/Japanese/Korean users.
Related: Martian-Engineering#208 (search path for lcm_expand_query candidate resolution)

* fix: tighten CJK summary search semantics

Keep mixed CJK and Latin summary queries on full-intent matching while
preserving the new CJK-specific recall improvements. Route short CJK
segments through the LIKE fallback so one- and two-character queries do
not regress, and update fallback coverage plus a release note.

Regeneration-Prompt: |
  Address review feedback on the PR that added trigram-backed CJK summary
  search. Preserve the additive migration and the improved recall for CJK
  phrasing differences, but fix the cases where mixed-language queries were
  broadened from implicit AND to OR and where very short CJK queries could
  return no results. Keep the work localized to summary search behavior,
  add regression tests for mixed CJK plus Latin queries and single-character
  CJK queries, and include a changeset because this is user-facing search
  behavior.

---------

Co-authored-by: scott <scott@Scott4.local>
Co-authored-by: Scott Lin <catgodtw@users.noreply.github.com>
Co-authored-by: Josh Lehman <josh@martian.engineering>
…ian-Engineering#148)

* lossless-claw-3ea: add transcript GC maintenance for externalized tool results

Add a summarized-tool candidate query in SummaryStore and implement LcmContextEngine.maintain() for the conservative first transcript-GC pass. This pass only rewrites tool-result transcript entries that were already externalized into large_files during ingest, are linked through summary_messages, and are no longer present as raw context items. Rebuild replacement toolResult messages from stored message_parts, align them to transcript entries by stable toolCallId, and request runtime-owned rewrites in small batches. Also export the minimal assembler helpers needed for replacement reconstruction and add focused engine tests for candidate selection and maintain()-driven rewrite requests.

Regeneration-Prompt: |
  Implement Phase 2 of the tool-result externalization spec now that upstream OpenClaw has merged the transcript maintenance hook and rewrite helper. Keep this first pass conservative and additive: do not redesign compaction or add new schema unless required. Select transcript-GC candidates from LCM state only when a tool-result message was already externalized into large_files, is covered by summaries, and is no longer present as a raw context item. Rebuild the compact replacement message from stored message_parts so the placeholder content stays canonical, then align candidates to active transcript entries by stable toolCallId and ask the runtime to rewrite them in bounded batches. Skip anything ambiguous instead of trying to be clever. Add focused tests that prove candidate discovery works and that maintain() requests the expected rewrite payload for a summarized externalized tool result.

* docs: add transcript GC spec and changeset

Document the current state of tool-result externalization,
incremental bootstrap, and transcript GC in the repo spec.
Add a changeset for the new runtime-assisted transcript GC
behavior so release notes capture the user-visible impact.

Regeneration-Prompt: |
  OpenClaw upstream landed the transcript rewrite maintenance API,
  and this branch already implements the first pass of transcript GC
  for summarized externalized tool results. Add the missing repo-side
  documentation so the PR is self-contained: a spec in specs/ that
  explains what is already implemented, why it matters operationally,
  and what still remains to finish the design. Also add a changeset,
  because this changes user-visible runtime behavior by shrinking
  active transcripts after safe condensation. Do not pretend the
  implementation is complete; call out the remaining work explicitly,
  including legacy inline tool results, stronger transcript alignment,
  tighter eligibility/fresh-tail rules, and end-to-end integration
  coverage.
…g#243)

* feat: add bundled lossless-claw skill and /lcm diagnostics

Add the approved MVP operator surface for lossless-claw. This ships a bundled lossless-claw skill with focused references, registers a native /lcm command with /lossless as the alias, and exposes scan-only summary health diagnostics through /lcm doctor. It also updates package metadata so the skill is bundled and adds a changeset for the new user-facing surface.

Regeneration-Prompt: |\n  Implement the approved lossless-claw MVP operator surface inside the plugin package without depending on the Go TUI binary. Add a concrete plan doc first, then ship a bundled skill named lossless-claw with references covering configuration, architecture, diagnostics, and recall-tool usage. Register native plugin commands centered on /lcm with /lossless as the alias. Keep the command surface narrow: /lcm should report version, enabled and selected state, DB path and file size, summary counts, a defensible summarized-context metric, and whether broken or truncated summaries are present. /lcm doctor should be the only user-facing summary-health diagnostic entrypoint in MVP and should stay scan-only instead of exposing advanced repair or rewrite operations. Keep changes scoped, add tests for manifest metadata, registration, and command behavior, and update README plus release metadata for the new bundled skill and command surface.

* Polish lossless command status output

Keep /lossless as the surfaced native command while documenting /lcm as the hidden alias. Rework status and doctor output into compact section cards, split GLOBAL vs CURRENT CONVERSATION reporting, and fall back cleanly when the host does not expose session identity. Add focused tests for the fallback path and the forward-compatible session-key path.

Regeneration-Prompt: |
  Refine the lossless-claw command polish only. Keep `/lossless` as the visible native command and `/lcm` as an accepted hidden alias. Add built-in command docs that point users to `/lossless help`, reformat status and doctor output into compact emoji section cards, and split GLOBAL stats from CURRENT CONVERSATION stats. Investigate whether the plugin command handler can resolve the active LCM conversation from host-provided session identity; support hidden `sessionKey` or `sessionId` fields if they appear, but when the current OpenClaw command API does not expose them, show the nicest possible fallback explaining that only GLOBAL stats are available. Update targeted tests for the new help text, status layout, host-gap fallback, and forward-compatible session-key resolution.

* Use session-key resolution in /lossless status

Resolve the current LCM conversation from ctx.sessionKey first, with ctx.sessionId as a compatibility fallback when the active key is not stored yet. Keep mismatched session-id fallbacks unavailable so the status card does not show the wrong conversation, and add focused command tests for direct resolution, fallback, and mismatch handling.

Regeneration-Prompt: |
  Update the /lossless slash command status output so the CURRENT CONVERSATION section reflects the active LCM conversation for the OpenClaw plugin-command session. The host now passes PluginCommandContext.sessionKey and sessionId. Treat the active session key as authoritative, keep /lossless as the visible command and /lcm as the hidden alias, preserve the existing emoji/status-card formatting and lightweight help text, and fall back gracefully with explicit messaging when the current conversation cannot be resolved.

  If the active session key is not stored in the conversations table yet, use the active session id only as a compatibility fallback so older rows without session_key can still show current-conversation stats. Refuse that fallback when it points at a conversation already bound to a different stored session key, because that would show the wrong conversation. Add focused tests that cover direct session-key resolution, the session-id compatibility fallback, and the mismatch case, then verify the command tests and full suite still pass.

* Polish /lossless status card formatting

Tighten the /lossless status presentation without changing current-conversation resolution. Switch the card to compact label:value lines, rename the header alias copy, move section titles to title case, and remove session id from the visible current-conversation block while keeping session-key resolution and session-id fallback behavior intact.

Regeneration-Prompt: |
  Polish the /lossless status output on top of the existing session-key resolution work. Keep /lossless as the visible slash command and /lcm as the alias, preserve the active-session-key current-conversation behavior, and do not reintroduce the old binding-based resolution path.

  Adjust the card so it reads well in chat screenshots: avoid all-caps section headers, tighten spacing so it feels like a compact status card instead of debug output, change the header copy from Hidden alias to Alias, and remove current conversation session id from the displayed fields while keeping session key. Update the focused command tests to match the new formatting and verify both the command test file and the full test suite still pass.

* Tighten /lossless status card formatting

* fix: scope /lossless doctor to current conversation

Make /lossless doctor resolve the active LCM conversation using the same session-key/session-id logic as status and refuse to run a global scan when the current conversation cannot be resolved. Keep /lossless visible, preserve /lcm as the alias, and add focused tests for scoped issue, scoped clean, and unavailable behavior.

Regeneration-Prompt: |
  Josh changed the MVP requirement for `/lossless doctor`: it must only diagnose the current LCM conversation from the plugin command context, using the same session-key/session-id resolution path already used by status. If the current conversation cannot be resolved, return an explicit unavailable message and say that no global scan ran. Keep `/lossless` as the visible command, preserve `/lcm` as alias, retain the compact text format, and add focused tests covering a resolved conversation with local issues, a resolved clean conversation, and unresolved context with no global fallback.

* feat: add scoped lossless doctor apply

Implement a native TypeScript repair path for /lossless doctor apply.

Keep doctor scoped to the resolved current conversation only. Leave /lossless doctor as a read-only scan, and add /lossless doctor apply to rewrite detected broken summaries in place using the plugin's existing summarization runtime instead of the Go TUI bridge. Preserve the compact status-card output, return an explicit unavailable message when the current conversation cannot be resolved, and cover clean no-op, successful scoped repair, and unresolved no-global-fallback behavior in focused command tests.

Regeneration-Prompt: |
  Add a native TypeScript implementation for  inside the lossless-claw plugin. Keep  as a read-only scan and never broaden either command beyond the current conversation exposed by the host session identity. Reuse the existing broken-summary marker detection, order repairs bottom-up so condensed nodes can consume freshly repaired child summaries, and rewrite repaired summaries in place in SQLite. Use the plugin's own summarization/runtime facilities instead of calling into the Go TUI. Preserve the compact status-card command output, and if the active conversation cannot be resolved, return an explicit unavailable response without attempting any global scan or repair. Add focused tests for a clean no-op apply, a scoped repair that actually mutates summaries, and the unresolved case proving there is no global fallback.

* fix: improve doctor apply guidance and model fallback

* fix: refine lossless status metrics

* fix: simplify lossless compression ratio

* docs: polish bundled lossless-claw skill

* docs: complete bundled lossless-claw skill
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…#263)

* fix: prune heartbeat turns before compaction

* fix: use sessionKey continuity in afterTurn replay dedup

Resolve the conv642 replay-regression in the afterTurn dedup guard by looking up the stored conversation through the same stable session identity used elsewhere in the engine. The dedup path now prefers sessionKey continuity and only falls back to sessionId through the existing store helper, which prevents restart replays from being treated as fresh history when OpenClaw rotates the runtime sessionId for the same top-level session. Add a focused regression covering restart replay under agent:main:main with a changed runtime sessionId.

Regeneration-Prompt: |\n  Fix the conv642 / 0.6.0 replay-regression in lossless-claw without broad refactoring. The likely bug is that afterTurn replay dedup looks up prior history by sessionId too loosely, while the rest of the engine already treats stable sessionKey continuity as the canonical identity for a live conversation. Make the smallest code change that brings replay dedup into line with the existing getConversationForSession behavior, preserving current fallback behavior when no sessionKey exists. Add focused regression coverage for the real failure mode: a restart or runtime recycle changes the sessionId but keeps the same stable sessionKey, and the replayed historical prefix must still be deduplicated instead of re-ingested. Keep the scope limited to the conv642 replay issue.

* test: update compaction telemetry integration expectations

Refresh the lcm integration tests to match the intended compaction-telemetry cleanup. The compaction engine still reports meaningful result metadata and persists summaries, but it no longer writes synthetic compaction message parts into canonical transcript state. Replace the stale compaction-part assertions with checks that no compaction parts are persisted while leaf and condensed compaction still reduce tokens and create the expected summaries/context transitions.

Regeneration-Prompt: |\n  CI started failing in test/lcm-integration.test.ts after the compaction-telemetry cleanup because two integration tests still expected synthetic compaction parts to be persisted into canonical transcript output. Update those tests only. Keep the new assertions meaningful: verify that canonical transcript state stays free of compaction parts, while compaction still returns useful result metadata, reduces token counts, and creates leaf/condensed summaries and summary context items as appropriate. Rerun the relevant integration file, then a slightly broader pass including engine tests to confirm the branch remains green.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…Engineering#270)

Regeneration-Prompt: |
  Phase 1 for lossless-claw issue Martian-Engineering#268. Timeout-recovery compaction was
  forcing budget-targeted recovery through compactFullSweep(), which only
  reasons over persisted context tokens. In the incident shape, live context
  was 277,403 tokens while stored context was already much smaller, so the
  forced sweep path could no-op on the wrong signal instead of using the
  capped compactUntilUnder() loop.

  Change only the routing needed for forced budget recovery. Preserve the
  existing full-sweep behavior for manual compaction requests and proactive
  threshold sweeps. Add focused regression coverage that proves the forced
  recovery path now calls compactUntilUnder() with the budget target and live
  token count, while threshold-target sweeps still stay on compactFullSweep().
  Include a patch changeset because this is a user-visible bug fix.
…Anthropic no longer supporting usage plans) (Martian-Engineering#273)

* fix: support runtime-managed oauth summarizer providers

* docs: add summary timeout config and preserve default

* fix: restore oauth summarizer behavior support

* fix: preserve codex oauth resolution and skip direct retry

* test: cover openai-codex expansion override happy path

* test: cover codex large-file summarization path

* test: clarify runtime-managed auth retry contract

* fix: use existing codex api predicate helper

* fix: note oauth summarizer support and timeout config

---------

Co-authored-by: Eva <eva@100yen.org>
Co-authored-by: Josh Lehman <josh@martian.engineering>
Martian-Engineering#261)

* fix: add per-DB async transaction mutex to prevent cross-session nested-transaction failures

Fixes Martian-Engineering#260

Root cause: Multiple async sessions share one synchronous DatabaseSync handle.
SQLite's transaction state is per-connection, so concurrent async code paths
that both issue BEGIN while the other is mid-transaction (awaiting async work)
cause 'cannot start a transaction within a transaction' errors.

Fix: Introduce acquireTransactionLock() — a per-database async mutex using
a WeakMap<DatabaseSync, promise-chain>. Applied to all three explicit
transaction entry points:

- ConversationStore.withTransaction() — BEGIN IMMEDIATE
- SummaryStore.replaceContextRangeWithSummary() — BEGIN
- lcm-doctor-apply.ts applyScopedDoctorRepair() — BEGIN IMMEDIATE

The mutex serializes transaction acquisition per DB instance while allowing
different databases to proceed independently.

Includes regression tests covering:
- Concurrent withTransaction from multiple sessions on one DB
- Concurrent replaceContextRangeWithSummary calls
- Cross-store (ConversationStore + SummaryStore) concurrent transactions
- Error propagation without mutex deadlock
- 10-session stress test
- Independent database isolation

* [subagent] fix: address PR Martian-Engineering#261 review nits

* fix: widen shared SQLite transaction coordination

* fix: add release notes for sqlite transaction hotfix

---------

Co-authored-by: Eva <eva@100yen.org>
Co-authored-by: Josh Lehman <josh@martian.engineering>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…g#283)

* fix: redirect LCM diagnostic log output to stderr

Route all deps.log calls through console.error() instead of api.logger.*
so that [lcm] diagnostic lines never contaminate stdout JSON output.

Fixes Martian-Engineering#165

Co-authored-by: RJ Johnston <293686+rjdjohnston@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: keep LCM diagnostics on stderr

---------

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: RJ Johnston <293686+rjdjohnston@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Josh Lehman <josh@martian.engineering>
…g#283)

* fix: redirect LCM diagnostic log output to stderr

Route all deps.log calls through console.error() instead of api.logger.*
so that [lcm] diagnostic lines never contaminate stdout JSON output.

Fixes Martian-Engineering#165

Co-authored-by: RJ Johnston <293686+rjdjohnston@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: keep LCM diagnostics on stderr

---------

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: RJ Johnston <293686+rjdjohnston@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Josh Lehman <josh@martian.engineering>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix: resolve TUI topic session lookups

Resolve TUI session metadata and count lookups against the selected conversation row instead of grouping by bare session_id. Topic-suffixed session filenames now prefer an exact session_key match and only then fall back to the normalized bare session_id, which restores conv_id, session key, summary count, and file count for Telegram topic sessions while preserving non-topic behavior. Reuse the same resolution path for single-session conversation lookups so summaries/files/context drill-downs follow the same normalization.

Regeneration-Prompt: |
  Fix the lossless-claw TUI bug where Telegram topic session files on disk are named like <session-id>-topic-<n> but LCM stores the bare runtime session_id and the topic identity separately in session_key. Keep the patch tight in tui/data.go and related tests. Preserve existing behavior for non-topic sessions. Resolve each visible session entry to a concrete conversation row first, preferring an exact session_key match for topic-suffixed filenames and otherwise falling back to the normalized bare session_id, then load summary/file counts by conversation_id so multiple topic rows sharing one bare session_id do not collapse together. Add regression coverage showing a topic session file now gets the right session key, conv_id, summary count, file count, and single-session lookup behavior.

* fix: note TUI topic session lookup correction
Martian-Engineering#288)

* fix: defer DB init to gateway_start hook to prevent database lock race

On macOS with launchd KeepAlive, gateway restarts can spawn two
processes simultaneously. Both call register() and open lcm.db,
causing "database is locked" errors that loop indefinitely.

Defer createLcmDatabaseConnection() and LcmContextEngine construction
from register() to the gateway_start plugin hook, which fires after
the HTTP server binds its port and stale PIDs are killed. Uses
module-level shared state so deferred plugin reloads reuse the
already-initialized connection.

Fixes Martian-Engineering#287

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings — FD leak, unhandled rejection, config staleness

Addresses Copilot review comments and adversarial audit findings:

1. Share only the DB handle at module scope; rebuild LcmContextEngine
   per-register() with fresh deps so hot-reloaded config takes effect.

2. Prevent unhandled promise rejection crash by attaching a no-op
   .catch() to the ready promise immediately after creation.

3. Close old DB connection when databasePath changes (prevents FD leak
   and stale locks — the exact problem this PR fixes).

4. Add gateway_stop handler to close DB cleanly on shutdown.

5. Fix half-initialized stuck state: if DB opens but engine fails in
   the else-if branch, properly set initError and reject the promise
   instead of silently swallowing.

6. Export __resetSharedInitForTests() for test isolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use closeLcmConnection for tracking, accept db callback in command

Addresses second round of Copilot review:

1. Use closeLcmConnection(db) instead of db.close() in the eager-init
   failure path to keep the connection tracking maps consistent.

2. Change createLcmCommand to accept db as DatabaseSync | (() => DatabaseSync)
   so the deferred getter can be passed without a type assertion cast.
   Backward-compatible: existing callers passing a plain DatabaseSync
   still work via the typeof check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: simplify to eager-first init with deferred fallback on lock only

Major simplification addressing test failures and review concerns:

The previous approach (defer everything to gateway_start, share DB at
module scope) broke tests that never fire gateway_start and introduced
complexity around shared state, promise lifecycle, and config staleness.

New approach: try eager DB init immediately in register() (preserving
original behavior for tests and normal startup). Only defer to
gateway_start if the eager open fails with "database is locked" — the
specific error from the macOS launchd orphan-process race.

This eliminates:
- Module-level shared state (no more sharedDb, no test pollution)
- Promise lifecycle complexity (no unhandled rejection risk in normal path)
- Config staleness (engine built with fresh deps every register())
- The need for __resetSharedInitForTests()

Each register() call gets its own DB handle and engine, matching the
original code's behavior. The only difference: lock errors are caught
and retried via gateway_start instead of looping forever.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings — lazy DB in command, handle leak, use-after-close

- Move getDb() into status/doctor branches so /lossless help never
  resolves the database (review comment lcm-command.ts:733)
- Close raw DatabaseSync handle when PRAGMA setup fails in
  createLcmDatabaseConnection to prevent FD leaks (review comment
  index.ts:1586)
- Clear deferredEngine on gateway_stop and guard getEngine() against
  closed database to prevent use-after-close (review comment
  index.ts:1642)
- Add tests covering the db: () => DatabaseSync lazy path: help
  must not invoke the resolver, status must (review comment
  lcm-command.ts:720)

* fix: disambiguate error messages for null database states

getDatabase() now distinguishes "closed after gateway_stop" from
"not yet initialized" with a stopped flag. getEngine() delegates
to getDatabase() instead of duplicating the null check with its
own misleading message.

* fix: guard getEngine against use-after-close, fix misleading comment

- Call getDatabase() before returning eagerly-constructed lcm so
  post-gateway_stop calls fail fast instead of returning an engine
  backed by a closed DB handle
- Update rethrow comment to accurately describe error propagation
  (framework handles it, not the engine constructor)

* fix: await deferred LCM init across runtime entrypoints

When eager DB open hits a lock during gateway restart, share one deferred
initialization promise across context-engine resolution, tools, commands,
and lifecycle hooks so the first request waits for gateway_start instead of
failing. Persist deferred retry failures so later callers see the real
error, and add a patch changeset for the user-visible startup fix.

Regeneration-Prompt: |
  Follow up on PR 288's deferred SQLite startup path for lossless-claw.
  The lock-contention fallback must not move the failure from plugin load to
  the first request: context engine resolution, plugin tools, commands, and
  lifecycle hooks should all await the same deferred initialization when the
  initial open fails with "database is locked" during macOS launchd
  restarts. If the deferred retry also fails, retain and rethrow that real
  error instead of misleading callers with a perpetual "waiting for
  gateway_start" message. Keep the eager-success path intact, add focused
  regression coverage for deferred success and deferred failure, and include
  the missing patch changeset because this changes user-visible runtime
  behavior.

---------

Co-authored-by: Eva <eva@100yen.org>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Josh Lehman <josh@martian.engineering>
…ering#294)

* perf: optimize SQLite PRAGMAs and add missing indexes

Zero-logic-change performance improvements for multi-GB databases
with concurrent agent sessions.

PRAGMAs added to configureConnection():
- cache_size = -65536 (64MB page cache, up from 2MB default)
  Demand-allocated, released on close. 5 connections = 320MB max.
- synchronous = NORMAL (officially recommended for WAL mode)
  Crash-safe for app crashes; only risks power-failure data loss.
  Bootstrap re-ingests any lost transactions from session files.
- temp_store = MEMORY (keeps temp B-trees in RAM)

Added PRAGMA optimize on connection close to update query planner
statistics for tables that changed during the session.

Missing indexes (cause full table scans on large databases):
- summary_messages(message_id) — needed for cascade delete lookups
- summaries(conversation_id, kind, depth) — needed for condensation
  depth filtering queries

Fixes Martian-Engineering#291 (partial — PRAGMA + index portion)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: move depth-dependent index after ensureSummaryDepthColumn migration

The summaries(conversation_id, kind, depth) index references the
depth column which is added by ensureSummaryDepthColumn(). The index
was in the initial schema creation (too early). Moved it to run
right after the depth column migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR Martian-Engineering#294 review — optimize error handling, index order, comments

1. PRAGMA optimize in separate try block so SQLITE_BUSY doesn't skip
   db.close() (handle leak prevention).

2. Index column order: (conversation_id, depth, kind) instead of
   (conversation_id, kind, depth) — matches getDistinctDepthsInContext
   query pattern which filters by conversation_id + depth.

3. Fixed misleading comment on summary_messages index.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: move depth index after backfillSummaryDepths to avoid migration overhead

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: assert perf indexes exist after migration (Martian-Engineering#291)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add changeset for sqlite tuning PR

---------

Co-authored-by: Eva <eva@100yen.org>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Josh Lehman <josh@martian.engineering>
…ian-Engineering#298)

engine.ts called compaction.compactFullSweep() directly for manual
and overflow compaction paths, bypassing the compact() method. Once
PR Martian-Engineering#295 adds the withContextCache wrapper to compact(), this direct
call would miss the per-phase context cache optimization.

Change: compactFullSweep → compact (same signature, same behavior,
but goes through the wrapper that future PRs will enhance).

Co-authored-by: Eva <eva@100yen.org>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ineering#285)

* feat: add conversation prune function for data retention

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: harden prune cutoff and delete flow

Use SQLite date math for prune candidate selection so mixed timestamp formats compare chronologically instead of lexically. Wrap confirm-mode candidate selection and deletion in one IMMEDIATE transaction to avoid deleting conversations that become fresh during the prune run.

Add a regression test covering SQLite-formatted timestamps on the cutoff boundary.

Regeneration-Prompt: |
  The prune helper added in PR 285 had two review findings to address before it is safe to use against a live LCM database. First, the candidate query compared message timestamps as raw TEXT against an ISO cutoff string. This repo stores some timestamps via SQLite datetime('now') and others via JavaScript toISOString(), so lexical comparison can prune same-day rows that are actually newer than the cutoff. Change the filter to use SQLite julianday(...) and add a regression test that seeds a SQLite-format timestamp newer than the cutoff but lexically smaller than the ISO string.

  Second, confirm-mode pruning selected candidates and then deleted them row by row outside a transaction. Tighten that by running candidate selection and deletion inside BEGIN IMMEDIATE so the prune sees one consistent snapshot and does not remove conversations that received a fresh message mid-run. Keep dry-run behavior unchanged and preserve the existing optional VACUUM behavior.

* fix: prune dependent records before deleting conversations

Delete summary lineage, context items, and FTS rows ahead of conversation deletion so prune works against the current schema's RESTRICT edges. Add a regression test that prunes a conversation containing summary_messages and context_items.

Regeneration-Prompt: |
  Running the prune helper against the live LCM database exposed a schema-level failure that the existing tests missed. Deleting a conversation directly did not work because several child tables mix CASCADE links from conversations with RESTRICT links back to messages and summaries. Reproduce that case with a test conversation that has a message, a linked summary, summary_messages lineage, and a context_items row. Then change prune so confirm-mode deletes the dependent rows in a safe order before deleting the conversation, and also clear any optional FTS rows tied to the pruned messages and summaries so search indexes do not retain orphaned entries.

* fix: batch prune live databases safely

Chunk confirmed pruning into bounded transactions so large live databases can be cleaned incrementally without one giant write lock. Delete cross-conversation context rows that reference pruned summaries or messages, and add supporting indexes plus regression coverage for batch mode and retained-context cleanup.

Regeneration-Prompt: |
  The prune helper already handled mixed timestamp formats and dependent summary/message cleanup, but it still did not work reliably on a large live LCM database. Update it so confirm-mode pruning runs in small committed batches instead of one giant transaction. Add options to control batch size and an optional max batch count for bounded runs. Preserve dry-run behavior.

  While testing against a large live database, pruning exposed an additional FK case: retained conversations can keep context_items rows that reference summaries being pruned from another conversation. Extend the delete path to remove context_items rows by referenced candidate message_id and summary_id, not just by candidate conversation_id. Keep the existing summary_messages and summary_parents cleanup.

  Add regression tests for multi-batch pruning, bounded batch runs, and the cross-conversation context_items case. Also add the missing indexes needed for live-scale deletes on summary_messages(message_id) and summary_parents(parent_summary_id).

* fix: checkpoint wal after prune vacuum

Follow VACUUM with wal_checkpoint(TRUNCATE) so operator-triggered prune runs reclaim disk space immediately in WAL mode instead of leaving the rewritten pages stranded in lcm.db-wal. Add a regression test that verifies the WAL is drained after a vacuumed prune.

Regeneration-Prompt: |
  The prune helper already supports an optional vacuum pass after confirmed deletion, but in WAL mode that still leaves reclaimed pages sitting in the WAL file until a checkpoint happens. Update the vacuum path so a prune with vacuum enabled also runs PRAGMA wal_checkpoint(TRUNCATE) immediately afterward. Keep the existing API shape.

  Add a focused regression test in prune.test.ts that proves the WAL is drained after a vacuumed prune, for example by checking PRAGMA wal_checkpoint(PASSIVE) returns zero log frames after the prune completes.

---------

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Josh Lehman <josh@martian.engineering>
…-Engineering#302)

* fix: singleton DB init per dbPath + fallback provider config

## Problem

OpenClaw v2026.4.5+ calls plugin register() per-agent-context (main,
subagents, cron lanes) — not once at startup. Each call opens a new
DB connection and runs migrations, causing "Migration failed: database
is locked" storms on large databases. PR Martian-Engineering#288's deferred-init fix
was merged but does not address this per-context re-registration.

## Solution

### Singleton DB + engine (critical fix)

Uses globalThis + Symbol.for() singleton (same pattern as
startup-banner-log.ts) keyed on normalized dbPath. When register()
is called again with the same DB path, it skips init entirely and
wires handlers to the existing waitForEngine/waitForDatabase closures
via wirePluginHandlers(). gateway_stop clears the singleton so a
fresh init occurs on restart.

The shared state stores only the closures (not mutable copies of
database/lcm locals), avoiding stale-reference bugs.

### Fallback provider config (additive)

- Add fallbackProviders config field (env: LCM_FALLBACK_PROVIDERS,
  format: provider/model,provider/model) for explicit compaction
  summarization fallbacks
- Append to existing 5-level candidate chain with dedup
- Exponential backoff (500ms→8s) between candidate retries
- PROVIDER FALLBACK / ALL PROVIDERS EXHAUSTED messages on stderr
- Half-threshold early warning and CIRCUIT BREAKER OPEN/CLOSED
  messages with cooldown time
- Startup banner for configured fallback providers

* fix: handle terminal summarizer exhaustion fallback

Route terminal non-auth provider failures through the shared exhaustion handler so deterministic truncation actually runs, add regression coverage, and include a changeset for the runtime behavior fix.

Regeneration-Prompt: |
  Address the PR review finding in the multi-provider summarizer fallback path. The existing code added an ALL PROVIDERS EXHAUSTED log after the candidate loop, but the loop always returned, continued, or threw before that block could execute. Preserve existing auth-failure behavior because LcmProviderAuthError is used intentionally by compaction and the circuit breaker, but make terminal non-auth failures fall through to one shared exhaustion path that logs clearly and returns buildDeterministicFallbackSummary instead of an empty string. Add a focused regression test that exhausts all resolved non-auth candidates and proves both the terminal log and deterministic fallback behavior. Add a patch changeset because this changes runtime behavior and logging for plugin summarization fallback.

---------

Co-authored-by: Eva <eva@100yen.org>
Co-authored-by: Josh Lehman <josh@martian.engineering>
jetd1 and others added 10 commits April 9, 2026 13:30
…r CJK/emoji) (Martian-Engineering#344)

* fix: CJK-aware token estimation with shared utility

Replace naive text.length/4 token estimation across all 6 call sites
with a shared code-point-aware estimator in src/estimate-tokens.ts.

- CJK (Chinese/Japanese/Korean): ~1.5 tokens/char
- Emoji / Supplementary Plane: ~2 tokens/char
- ASCII / Latin: ~0.25 tokens/char (~4 chars/token)

The old formula used String.length (UTF-16 code units) which
underestimates CJK by ~6x and emoji by ~2-4x, causing compaction
to trigger far too late for non-English conversations.

Closes Martian-Engineering#47, Closes Martian-Engineering#250, Closes Martian-Engineering#256, Closes Martian-Engineering#266

* fix: enforce unicode-aware compaction truncation

Keep compaction hard caps and deterministic fallback summaries inside their intended token budgets after switching to the shared Unicode-aware estimator. Add CJK-heavy regression coverage for both the summary cap path and fallback truncation, and add a patch changeset for the release notes.

Regeneration-Prompt: |
  Review PR Martian-Engineering#344's shared Unicode-aware token estimator for downstream callers that still assume 4 characters per token. Fix compaction so both the hard-cap path and the deterministic fallback truncate by estimated token budget instead of raw string length, preserving surrogate pairs and working for CJK-heavy or emoji-heavy text. Add regression tests in the compaction integration suite that prove capped summaries and fallback summaries stay within budget for CJK-heavy content, and add a patch changeset because this is user-visible compaction behavior.

---------

Co-authored-by: jet <dev@jetd.one>
Co-authored-by: Josh Lehman <josh@martian.engineering>
…ngineering#172)

* fix: skip ingesting empty error/aborted assistant messages

When an API call returns a 500 or similar transient error, OpenClaw
appends an assistant message with stopReason "error" and empty content
to the session. LCM ingests these into the database, and on retry the
accumulated empty messages are assembled into context — creating a
positive feedback loop where each retry sends a larger, malformed
payload that continues to fail.

This commit adds two defenses:

1. engine.ts (ingestSingle): Skip assistant messages where stopReason
   is "error" or "aborted" AND content is empty ([], "", null). Messages
   with actual partial content before the error are still preserved.

2. assembler.ts (resolveMessageItem): Defense-in-depth — skip empty
   assistant messages during context assembly when both the stored
   content text and message_parts are empty. This catches any
   previously-ingested empty messages without affecting legitimate
   assistant messages that have tool calls (which have empty text
   content but non-empty parts).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: handle snake_case stop_reason in ingest guard

Accept both stopReason and stop_reason when filtering empty assistant error/aborted turns during ingest. Extend the engine regression test to cover the snake_case field so the guard matches the finish-reason normalization already used elsewhere in the codebase.

Regeneration-Prompt: |\n  Review PR Martian-Engineering#172 after rebasing against origin/main and verify whether its empty-assistant ingest guard still misses any finish-reason spellings used elsewhere in this repository. Keep the fix narrow: preserve the PR's behavior, but make the ingest guard recognize both camelCase stopReason and snake_case stop_reason for assistant messages with empty content and error or aborted stop reasons. Add regression coverage in test/engine.test.ts for the snake_case variant and rerun the focused engine test file before pushing the result back to the contributor branch.

* chore: add changeset for empty error message fix

---------

Co-authored-by: Craig McWilliams <craigamcw@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Josh Lehman <josh@martian.engineering>
…ders (Martian-Engineering#330)

* fix: let provider config override builtin transport defaults

* fix: avoid silent openai fallback for custom providers

* test: clean up rebased summarize coverage

Remove the duplicate auth-handling tests left behind by the rebase conflict resolution so the summarize test file reflects one coherent post-review coverage set.

Regeneration-Prompt: |\n  Rebased PR 330 onto origin/main, then addressed review findings without changing the intended provider override feature. Preserve the fix that lets runtime provider config override built-in transport defaults, but keep custom OpenAI-compatible provider aliases eligible for the existing direct-credential retry when runtime.modelAuth returns a model.request scope failure. Also avoid tagging arbitrary provider/runtime exceptions as provider_config errors; only the explicit unresolved API-family case should surface that kind. After resolving the rebase conflict in test/summarize.test.ts, remove any duplicate tests introduced by conflict resolution and keep focused regression coverage for runtime-managed providers, custom-provider auth retries, and non-config provider failures. Include a patch changeset for the user-visible bug fix.

* test: align auth-profile harness with provider api guards

Keep the SecretRef auth-profile tests focused on credential resolution by feeding the test harness the same runtime config object through api.runtime.config.loadConfig(), and by defaulting the synthetic provider to an explicit API family. This matches the new custom-provider guard added in the PR without weakening the guard itself.

Regeneration-Prompt: |\n  PR 330 now requires custom providers to have an explicit API family instead of silently defaulting to OpenAI. The SecretRef auth-profile tests use a synthetic provider and were failing before completeSimple because their harness only set api.config and never surfaced models.providers.<provider>.api through runtime.config.loadConfig(). Update that test harness so it passes the same config object through runtime loadConfig and injects a test-only default API family for the synthetic provider, keeping the tests focused on env/file SecretRef credential resolution rather than provider API resolution.

---------

Co-authored-by: mozi1924 <15985142983@163.com>
Co-authored-by: Josh Lehman <josh@martian.engineering>
* docs: add LCM pre-typing latency memo

* chore: add lcm lifecycle instrumentation logs

Add lifecycle timing logs for LCM engine init, migrations, bootstrap,
maintain, assemble, and afterTurn so live OpenClaw traces can show where
latency is actually spent. Route migration step timing through the same
logger and keep the startup-banner test focused on banner deduping now
that the lifecycle markers emit at info level.

Regeneration-Prompt: |
  Investigate an LCM latency memo claim about process-global startup work
  and add instrumentation that can separate one-time engine initialization
  from per-turn overhead in a live OpenClaw deployment. Use the project's
  existing logging conventions rather than introducing a new sink. Measure
  engine initialization, migration steps, queue wait time, bootstrap,
  reconcileSessionTail outcomes, maintain, assemble, and afterTurn so the
  logs can bracket the full lifecycle for a real message. Promote the new
  markers to info level if the gateway's debug path is not reliably visible
  in production logs, and update the affected registration test so it still
  verifies startup-banner deduping without assuming the full info log set is
  limited to the banner lines.

* fix: drop stale cjk fts before probe

Preserve the migration ordering needed to drop a stale summaries_fts_cjk
table before standalone FTS probing runs. This keeps malformed legacy CJK
shadow tables from poisoning the self-heal probe path during migration.

Regeneration-Prompt: |
  After rebasing the LCM lifecycle instrumentation branch onto a newer main,
  rerun the focused migration tests. If the test covering stale
  summaries_fts_cjk cleanup fails again, restore the ordering that removes the
  stale CJK table before other standalone FTS probing occurs. Keep the newer
  standalone FTS self-heal helpers and instrumentation intact; only correct
  the ordering regression so malformed legacy CJK tables cannot break the
  migration probe path.
* fix: fall back to root plugin config

Restore runtime config loading for OpenClaw builds that do not pass a usable
api.pluginConfig into the plugin registration path. Add focused registration
coverage for the nested plugins.entries["lossless-claw"].config fallback and a
patch changeset for the runtime fix.

Regeneration-Prompt: |
  Implement fix (2) from the issue-325 investigation in lossless-claw. Keep the
  change narrow: harden runtime config loading so plugin registration uses
  api.pluginConfig when it is a plain object, but falls back to
  api.config.plugins.entries["lossless-claw"].config when api.pluginConfig is
  missing or unusable. Add targeted regression coverage for both the missing and
  invalid direct plugin-config cases, and include a patch changeset because this
  is a user-visible runtime compatibility fix.

* fix: fall back when pluginConfig is empty

Treat an empty object in api.pluginConfig as unusable so registration still falls back to plugins.entries["lossless-claw"].config on incompatible OpenClaw runtimes. Add regression coverage for the empty-object case alongside the existing missing and invalid pluginConfig scenarios.

Regeneration-Prompt: |
  Follow up on PR 328's plugin-config fallback fix in lossless-claw. Keep the change narrow: the direct api.pluginConfig path should still win when it contains real settings, but an injected empty object from incompatible OpenClaw runtimes must not suppress the fallback to api.config.plugins.entries["lossless-claw"].config. Extend the registration regression test matrix to cover the empty-object case and rerun the targeted vitest file.
…ailable (Martian-Engineering#351)

* fix: prevent overflow recovery from bailing when observed tokens unavailable

When the preemptive context overflow guard fires during the tool loop,
the error message does not include an observed token count. This means
observedTokens is undefined when the overflow recovery calls compact()
with force=true.

compactUntilUnder() then uses only the stored token count (which is low
because afterTurn hasn't ingested the current turn yet) and bails with
"already under target" — even though the live context is overflowing.

Fix: when force=true and observedTokens is undefined, pass tokenBudget
as currentTokens so compactUntilUnder knows we're at least at the budget
and proceeds with compaction instead of bailing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: cover forced overflow recovery without observed tokens

Add a regression test for PR Martian-Engineering#351's overflow-recovery path when force=true but the runtime does not provide currentTokenCount, and add a patch changeset for the recovery behavior fix.

Regeneration-Prompt: |
  Review PR Martian-Engineering#351, which fixes forced overflow recovery when OpenClaw reports a context overflow during the tool loop without an observed token count. Preserve the runtime fix in src/engine.ts, then add targeted regression coverage proving engine.compact() passes currentTokens equal to tokenBudget into compactUntilUnder() when force=true and currentTokenCount is absent. Keep the existing observed-token test intact, and add a patch changeset because this changes user-visible recovery behavior after overflow.

---------

Co-authored-by: Kit (OpenClaw) <kit@openclaw.ai>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Josh Lehman <josh@martian.engineering>
Rebase the PR branch onto origin/main and preserve the doctor clean apply
workflow that no longer replayed cleanly from the old stacked history.

Keep the user-facing command surface as `doctor clean` / `doctor clean apply`,
restore archived-only matching for NULL-key subagent context cleanup, surface
SQLite quick_check warnings in apply output, and carry the related docs,
tests, backup-path helpers, and changeset updates onto the rebased branch.

Regeneration-Prompt: |
  Rebase the existing PR 337 work onto current origin/main without losing the
  reviewed fixes that were developed on top of an older stacked branch. Preserve
  the additive doctor clean apply workflow, including backup-first deletion,
  backup-path handling for file-backed databases, and the renamed user-facing
  interface `doctor clean` / `doctor clean apply` across command parsing,
  output text, docs, and tests.

  Keep the safety review fixes intact while rebasing: the NULL-key subagent
  cleaner must only target archived conversations whose first stored message
  begins with `[Subagent Context]`, and doctor clean apply must downgrade its
  reported status to `warning` when `PRAGMA quick_check` returns anything other
  than `ok`. Add or preserve regression coverage for both behaviors and ensure a
  changeset is present because this is user-facing functionality.

Co-authored-by: Josh Lehman <josh@martian.engineering>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Replace afterTurn cache-state-based compaction with assembly-path
TTL-based trigger. Fixes the timing inversion where afterTurn
compacts immediately after a cold reading (when cache was just
written and is now hot).

Changes:
- Add cacheTTLSeconds config (default 300s) to cacheAwareCompaction
- Record lastApiCallAt in compaction telemetry after each API call
- Add pre-assembly compaction: if idle > cacheTTL and memory pressure
  exists, compact before assembling context (not after the call)
- Simplify evaluateIncrementalCompaction: remove hot-cache-defer,
  hot-cache-budget-headroom, and cold-cache-catchup branches
- Keep budget-trigger safety valve unchanged
- Update config tests for new cacheTTLSeconds field

Closes Martian-Engineering#367
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 10, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e5b3e20e-294d-4646-8e6b-8126e84f4c09

📥 Commits

Reviewing files that changed from the base of the PR and between 33ad257 and 58d28cb.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (90)
  • .changeset/bootstrap-context-budget.md
  • .changeset/calm-walls-hear.md
  • .changeset/loud-ravens-cheer.md
  • .changeset/lucky-pianos-learn.md
  • .changeset/new-reset-lifecycle.md
  • .changeset/plugin-config-schema-sync.md
  • .gitignore
  • .local/lcm-pretyping-latency-memo.md
  • .pebbles/.gitignore
  • .pebbles/config.json
  • .pebbles/events.jsonl
  • AGENTS.md
  • CHANGELOG.md
  • README.md
  • README_zh.md
  • docs/agent-tools.md
  • docs/configuration.md
  • openclaw.plugin.json
  • package.json
  • skills/lossless-claw/SKILL.md
  • skills/lossless-claw/references/architecture.md
  • skills/lossless-claw/references/config.md
  • skills/lossless-claw/references/diagnostics.md
  • skills/lossless-claw/references/recall-tools.md
  • skills/lossless-claw/references/session-lifecycle.md
  • specs/lossless-claw-mvp-skill-and-commands.md
  • specs/tool-result-externalization-and-incremental-bootstrap.md
  • src/assembler.ts
  • src/compaction.ts
  • src/db/config.ts
  • src/db/connection.ts
  • src/db/features.ts
  • src/db/migration.ts
  • src/engine.ts
  • src/estimate-tokens.ts
  • src/lcm-log.ts
  • src/plugin/index.ts
  • src/plugin/lcm-command.ts
  • src/plugin/lcm-doctor-apply.ts
  • src/plugin/lcm-doctor-cleaners.ts
  • src/plugin/lcm-doctor-shared.ts
  • src/plugin/shared-init.ts
  • src/prune.ts
  • src/retrieval.ts
  • src/startup-banner-log.ts
  • src/store/compaction-telemetry-store.ts
  • src/store/conversation-store.ts
  • src/store/fts5-sanitize.ts
  • src/store/full-text-sort.ts
  • src/store/index.ts
  • src/store/parse-utc-timestamp.ts
  • src/store/summary-store.ts
  • src/summarize.ts
  • src/tools/lcm-describe-tool.ts
  • src/tools/lcm-expand-query-tool.ts
  • src/tools/lcm-expand-tool.ts
  • src/tools/lcm-expansion-recursion-guard.ts
  • src/tools/lcm-grep-tool.ts
  • src/transaction-mutex.ts
  • src/types.ts
  • test/assembler-blocks.test.ts
  • test/bootstrap-flood-regression.test.ts
  • test/bootstrap-message-only.test.ts
  • test/circuit-breaker.test.ts
  • test/config.test.ts
  • test/engine.test.ts
  • test/estimate-tokens.test.ts
  • test/expansion.test.ts
  • test/fts-fallback.test.ts
  • test/fts5-sanitize.test.ts
  • test/index-complete-model-auth.test.ts
  • test/index-complete-provider-config.test.ts
  • test/index-secret-ref-auth-profiles.test.ts
  • test/lcm-command.test.ts
  • test/lcm-expand-query-tool.test.ts
  • test/lcm-integration.test.ts
  • test/lcm-tools.test.ts
  • test/migration.test.ts
  • test/parse-utc-timestamp.test.ts
  • test/plugin-config-registration.test.ts
  • test/plugin-prompt-hook.test.ts
  • test/prune.test.ts
  • test/retrieval-sort.test.ts
  • test/session-operation-queues.test.ts
  • test/summarize.test.ts
  • test/transaction-mutex.test.ts
  • test/vitest-isolation.test.ts
  • tui/data.go
  • tui/sessions_test.go
  • vitest.config.ts

Cache: Disabled due to Reviews > Disable Cache setting

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.


📝 Walkthrough

Walkthrough

Release v0.8.0 consolidates comprehensive improvements to the lossless-claw plugin, including new native commands (/lcm, /lossless doctor), conversation cleanup diagnostics and repair tools, cache-aware incremental compaction, multi-conversation expansion with bounded synthesis, prompt-aware context assembly, transcript garbage collection, and deterministic token-capped summarization with CJK/emoji support.

Changes

Cohort / File(s) Summary
Changesets & Package Configuration
.changeset/*, .gitignore, package.json
Removed 5 changeset entries and consolidated into v0.8.0 release notes. Updated .gitignore to exclude .pebbles, docs/plans/, and TASK.md. Bumped version to 0.8.0 and added skills/ directory to published files. Deleted .pebbles/ event log and config artifacts.
Release Documentation
CHANGELOG.md, README.md, README_zh.md
Added comprehensive CHANGELOG.md v0.8.0 release notes for minor features and patch hardening. Updated README with new commands/skills section, plugin allowlisting guidance, configuration examples (timeouts, fallback providers), and session lifecycle semantics. Added full Chinese-language README with deployment guidance and fixed-behavior notes.
Skill & Agent Documentation
AGENTS.md, skills/lossless-claw/*, specs/*
Added SKILL.md and four reference guides (architecture, config, diagnostics, recall-tools, session-lifecycle). Updated AGENTS.md with maintenance directives for config documentation consistency. Added two specification documents for MVP skill/commands and tool-result externalization with incremental bootstrap.
Configuration Reference
docs/agent-tools.md, docs/configuration.md, openclaw.plugin.json
Restructured docs/configuration.md from tuning guide to reference format with complete config example and precedence table. Enhanced agent-tools.md with FTS5 full-text guidance, sort parameter documentation (recency/relevance/hybrid), and multi-conversation expansion details. Expanded openclaw.plugin.json with 10+ new config keys: leafMinFanout, condensedMinFanout, databasePath (alias), largeFileThresholdTokens, summaryTimeoutMs, circuit-breaker thresholds, cacheAwareCompaction and dynamicLeafChunkTokens nested objects, fallbackProviders array. Added skills manifest declaration.
Token Estimation & Utilities
src/estimate-tokens.ts, src/startup-banner-log.ts, src/lcm-log.ts, src/transaction-mutex.ts, src/store/parse-utc-timestamp.ts, src/store/full-text-sort.ts
Added token estimation with CJK/emoji weighting; truncation to maxTokens with code-point alignment. Introduced logging layer with createLcmLogger, error description utility, and NOOP fallback. Implemented withDatabaseTransaction for serialized SQLite transactions with nested savepoint support via AsyncLocalStorage. Added UTC timestamp parsing helpers. Introduced FTS5 SearchSort type (recency/relevance/hybrid) with age-decay BM25 hybrid scoring.
Database Layer: Connection & Features
src/db/connection.ts, src/db/features.ts
Exported isInMemoryPath and getFileBackedDatabasePath helpers. Updated normalizePath to use shared path resolution. Enhanced configureConnection with cache size, synchronous mode, and temp store pragmas; added best-effort PRAGMA optimize before close. Introduced probeVirtualTable helper and probeTrigramTokenizer for CJK FTS5 support detection; extended LcmDbFeatures with trigramTokenizerAvailable.
Database Layer: Migrations
src/db/migration.ts
Added optional migration logging via options.log. Refactored FTS5 maintenance into generic ensureStandaloneFtsTable helpers with schema validation and stale-table drop logic. Added conditional summaries_fts_cjk creation when trigram tokenizer available. Enhanced conversation_compaction_telemetry schema with telemetry columns. Added indexes: summary_messages_message_idx, summary_parents_parent_summary_idx, summaries_conv_depth_kind_idx.
Database Layer: Config & Stores
src/db/config.ts, src/store/compaction-telemetry-store.ts, src/store/conversation-store.ts, src/store/summary-store.ts
Exported CacheAwareCompactionConfig and DynamicLeafChunkTokensConfig types. Extended LcmConfig with summaryTimeoutMs, fallbackProviders, cacheAwareCompaction, dynamicLeafChunkTokens; removed summaryProvider/summaryModel. Added finite-number parsers for safe env-var parsing. Implemented CompactionTelemetryStore with transaction support for per-conversation telemetry. Updated stores to use shared transaction wrapper, parse UTC timestamps consistently, add sort parameter to search inputs, and route CJK queries through trigram/LIKE fallbacks. Added SummaryStore.listTranscriptGcCandidates for externalized tool-result GC.
Core Engine Logic
src/assembler.ts, src/compaction.ts, src/engine.ts
Assembler: Imported estimateTokens; added optional prompt?: string to AssembleContextInput. Implemented prompt-aware relevance scoring with BM25-lite (tokenizeText, scoreRelevance). Updated eviction to score/rank by relevance when prompt provided, else chronological. Added strict empty-assistant-message filtering. Compaction: Replaced char-based fallback cap with token-based truncation. Introduced reference-counted context cache (_contextItemsCache, getContextItemsCached, invalidateContextCache) with transaction wrapping. Added leafChunkTokensOverride and allowCondensedPasses parameters. Moved event logging from conversationStore.addSystemMessage to LcmLogger. Engine: Added maintain() for transcript GC via rewriteTranscriptEntries. Implemented compaction telemetry tracking, cache-aware incremental compaction, dynamic leaf chunk sizing, multi-pass fallback recovery, import-flood safeguards. Added handleSessionEnd() for generic session rollover, getCompactionTelemetryStore() accessor. Improved bootstrap robustness with empty-budget handling and message-only JSONL parsing.
Retrieval & Search
src/retrieval.ts, src/store/fts5-sanitize.ts
Added sort?: SearchSort to GrepInput and threaded through search backend. Removed post-query local ordering (delegated to SQL). Imported estimateTokens instead of local impl. Updated sanitizeFts5Query to preserve quoted multi-word phrases and tokenize only unquoted substrings.
Plugin Infrastructure
src/plugin/index.ts, src/plugin/shared-init.ts, src/types.ts, src/summarize.ts
Plugin index: Added runtime auth enrichment (baseUrl, request overrides, expiresAt). Enhanced deps.complete to call modelAuth.getRuntimeAuthForModel when available and merge auth-provided transport config. Added provider-config error detection. Refactored to wirePluginHandlers for shared initialization. Introduced shared-singleton deferred-init keyed by DB path with gateway_start/gateway_stop coordination. Shared-init: New process-global store for per-DB-path SharedLcmInit state (stopped flag, cached/deferred engine/DB accessors). Summarize: Dynamic summaryTimeoutMs from config. Added explicit fallback providers from config. Replaced console.* with deps.log.*. Implemented retry-on-auth-error with exponential backoff and deterministic fallback on exhaustion. Types: Extended CompleteFn with optional skipModelAuth. Added isRuntimeManagedAuthProvider optional dependency.
Plugin Commands
src/plugin/lcm-command.ts
New /lossless command (hidden alias /lcm) with subcommands: status (global/scoped stats), doctor (scan for broken/truncated summaries), doctor apply (in-place repair), doctor clean (identify/apply high-confidence deletion candidates), help. Implemented DB-backed status reporting from summaries and context_items tables. Generates markdown formatted responses with plugin enabled/selected status, DB size, summary counts, token coverage, doctor findings. doctor apply uses injected summarizer; doctor clean apply creates backup, stages candidates in temp tables, deletes with optional vacuum.
Doctor/Cleaner Tools
src/plugin/lcm-doctor-shared.ts, src/plugin/lcm-doctor-apply.ts, src/plugin/lcm-doctor-cleaners.ts
Shared: Defined marker constants (fallback/truncated), detectDoctorMarker classifier, loadDoctorTargets query helper, getDoctorSummaryStats aggregation. Apply: Loads doctor targets, resolves summarizer (from fn or deps), rebuilds summaries from messages/child summaries, validates marker absence, commits via transaction. Cleaners: Defines three DoctorCleanerId categories (archived_subagents, cron_sessions, null_subagent_context). Implements scanDoctorCleaners with per-filter conversation/message stats and examples, and applyDoctorCleaners with backup-first deletion via temp tables, optional vacuum, and integrity checks.
Retrieval & Expansion Tools
src/tools/lcm-grep-tool.ts, src/tools/lcm-expand-tool.ts, src/tools/lcm-expand-query-tool.ts, src/tools/lcm-describe-tool.ts, src/tools/lcm-expansion-recursion-guard.ts
Grep: Added sort parameter enum (recency/relevance/hybrid); regex mode always uses recency, full_text uses requested. Updated FTS description. Expand/Describe: Made lcm optional; added getLcm() async provider. Expand-query: Rewrote for multi-conversation support: ranks conversation buckets, delegates to up to DEFAULT_MAX_CONVERSATION_BUCKETS, tracks sourceConversationIds, returns conversationBreakdown diagnostic, marks skipped/truncated buckets. Added runDelegatedExpandQuery helper. Expansion-guard: New concurrency blocking for delegated expansion from same origin session (prevents nested expansions), returns structured ExpansionConcurrencyGuardDecision.
Pruning & GC
src/prune.ts
New conversation retention pruning with parseDuration (day/week/month/year inputs), dry-run candidate selection, batch-based deletion with cascade cleanup, optional vacuum with WAL checkpoint. Returns PruneResult with candidates, deleted count, vacuum confirmation, cutoff date.
Unit Tests
test/assembler-blocks.test.ts, test/estimate-tokens.test.ts, test/fts5-sanitize.test.ts, test/parse-utc-timestamp.test.ts, test/lcm-tools.test.ts
Added/extended token estimation (CJK/emoji weighting), FTS5 sanitization (quoted phrase preservation), UTC timestamp parsing, lexical relevance scoring for assembler, tool metadata descriptions.
Integration & Feature Tests
test/bootstrap-flood-regression.test.ts, test/bootstrap-message-only.test.ts, test/prune.test.ts, test/retrieval-sort.test.ts, test/fts-fallback.test.ts
New bootstrap robustness tests (maintain checkpoint + append-only recovery), message-only JSONL parsing, conversation pruning with date math and batch deletion, FTS5 sort behavior (recency/relevance/hybrid), CJK/single-character full-text search fallback.
Compaction & Telemetry Tests
test/lcm-integration.test.ts, test/circuit-breaker.test.ts, test/session-operation-queues.test.ts, test/expansion.test.ts
Updated mock stores with withTransaction support. Extended tests for allowCondensedPasses control, deterministic fallback CJK handling, dynamic leaf chunk sizing with overrides, prompt-aware context eviction. Updated test configs with new fields (cacheAwareCompaction, dynamicLeafChunkTokens, summaryTimeoutMs, fallbackProviders). Added mocked session logger and hook-capture mechanisms.
Plugin & Command Tests
test/plugin-config-registration.test.ts, test/index-complete-model-auth.test.ts, test/index-complete-provider-config.test.ts, test/index-secret-ref-auth-profiles.test.ts, test/plugin-prompt-hook.test.ts, test/migration.test.ts, test/config.test.ts
Enhanced plugin registration with deferred init on "database locked", singleton reuse, and gateway_stop cleanup. Extended model-auth tests for runtime getRuntimeAuthForModel and fallback chain. Added provider-config error detection. Updated all test fixtures with new config fields. Strengthened migration tests for FTS5/trigram recovery and stale-table cleanup. Added manifest and config resolution coverage for nested cacheAwareCompaction/dynamicLeafChunkTokens with env-var overrides.
Command & Tool Tests
test/lcm-command.test.ts, test/lcm-expand-query-tool.test.ts
Comprehensive lcm-command suite (1139 lines) covering status/doctor/clean flows, conversation resolution, repair with injected summarizer, deletion with backup/vacuum, quick-check integrity. Multi-conversation expansion tests (936 lines) for bucket ranking, partial results on timeout/budget exhaustion, concurrency blocking when same origin-session delegates.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Agent
    participant LcmEngine
    participant Delegated as Delegated<br/>Sub-Agent
    participant Summarizer
    
    User->>Agent: /lcm_expand_query {allConversations:true}
    activate Agent
    Agent->>LcmEngine: expand_query(...)<br/>with allConversations=true
    activate LcmEngine
    
    LcmEngine->>LcmEngine: Rank conversation<br/>buckets
    LcmEngine->>LcmEngine: Acquire concurrency<br/>slot (origin session)
    
    loop For top N buckets (token budget aware)
        LcmEngine->>Delegated: Create delegation grant<br/>with conversation bucket
        activate Delegated
        Delegated->>LcmEngine: Request context via<br/>lcm_expand(summaryIds)
        LcmEngine->>LcmEngine: Assemble bucket context
        LcmEngine-->>Delegated: Assembled context
        Delegated->>Summarizer: Synthesize answer<br/>for bucket
        Summarizer-->>Delegated: Synthesized answer
        Delegated-->>LcmEngine: Delegation response
        deactivate Delegated
        
        LcmEngine->>LcmEngine: Append answer to result<br/>Track sourceConversationId
        LcmEngine->>LcmEngine: Deduct tokens from<br/>remaining budget
    end
    
    LcmEngine->>LcmEngine: Mark skipped buckets<br/>in conversationBreakdown
    LcmEngine->>LcmEngine: Release concurrency slot
    LcmEngine-->>Agent: Merged answer +<br/>sourceConversationIds +<br/>breakdown
    deactivate LcmEngine
    
    Agent->>User: Synthesized cross-conversation<br/>answer with sources
    deactivate Agent
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

Rationale: Extremely heterogeneous changes spanning new features (multi-conversation expansion, doctor/cleaners, transcript GC), core refactors (cache-aware compaction, incremental bootstrap, token estimation), database schema/migrations, plugin infrastructure (shared-init, deferred init, runtime auth), 15+ new public types/functions, and 40+ new test files. Requires careful review of interaction between cache invalidation, transaction semantics, telemetry-driven compaction decisions, concurrency guards, and multi-conversation delegation flow. Dense logic in compaction engine, plugin initialization, and command implementations warrants line-by-line scrutiny.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@liu51115 liu51115 closed this Apr 10, 2026
@liu51115 liu51115 deleted the feat/cache-ttl-compaction branch April 12, 2026 08:03
100yenadmin pushed a commit that referenced this pull request May 5, 2026
…+ message grep cascade + over-cap accounting + purge doc (P1+P2)

Resolves all four findings from the final adversarial review.

## P1 #1 — Semantic backfill is no longer production-inert

Reviewer was right: connection.ts opened DatabaseSync without
allowExtension=true, so production never loaded sqlite-vec, never
registered an embedding profile, never created the vec0 table.
Autostart's pre-flight returned NO_OP and the entire v4.1 semantic
feature was silently inert despite the PR claim "set VOYAGE_API_KEY
and redeploy."

Fix:
- src/db/connection.ts: open with `{allowExtension: true}` so
  db.loadExtension() works
- src/operator/semantic-infra-init.ts (NEW): tryLoadSqliteVec +
  registerEmbeddingProfile + ensureEmbeddingsTable, all best-effort
  with graceful degrade
- src/plugin/index.ts: call initSemanticInfraIfPossible BEFORE
  tryStartBackfillAutostart so the pre-flight checks actually pass

Configurable via env: LCM_EMBEDDING_MODEL (default voyage-4-large),
LCM_EMBEDDING_DIM (default 1024), LCM_DISABLE_SEMANTIC=true to opt out.

## P1 #2 — Suppressed leaves no longer leak through raw message grep

Reviewer was right: runPurge set summaries.suppressed_at but never
touched messages.suppressed_at, and conversation-store.ts message
search didn't filter on it. Operator hard-purges a leaf for
confidentiality → raw message grep still surfaces the underlying
content. Privacy/correctness blocker.

Fix:
- src/store/conversation-store.ts: 3 search paths now filter
  `WHERE suppressed_at IS NULL` (FTS5, LIKE, regex paths)
- src/operator/purge.ts: runPurge soft mode now cascades to
  messages.suppressed_at via summary_messages junction table

Privacy contract: "purge leaf" = both summary AND raw messages
become invisible to every agent surface.

## P2 #3 — Immediate-purge JSDoc no longer lies

Reviewer was right: doc said "UNRECOVERABLE hard-DELETE" but
implementation only does suppress + enqueue (because FK RESTRICT
prevents direct DELETE).

Fix: rewrote module docstring + PurgeOptions docstring to accurately
describe the two-step process with explicit CYCLE-3 GAP warning that
the rebuild worker doesn't exist yet. Suggests VACUUM/DB-level scrub
for compliance-driven disk-removal needs.

## P2 #4 — Over-cap leaves now surfaced in /lcm health

Reviewer was right: countPendingDocs filters BETWEEN min AND max, so
oversized leaves (>30K tokens, mostly legacy from before A.10 cap)
were neither embedded nor reported as pending. Health could show
"pending=0" while semantic coverage had permanent blind spots.

Fix:
- src/operator/health.ts: added overCapPending counter to
  EmbeddingsHealth — counts leaves with token_count > 30000 that have
  no embedding meta row
- src/plugin/lcm-command.ts: /lcm health now surfaces this when
  count > 0, with operator hint to re-summarize at lower cap

## Test status

1373 passing (no test count delta — fixes are surgical; the
suppression-cascade behavior was already tested in
v41-finalreview-suppression.test.ts which now covers the message
path too via the existing assertions).

Build: dist/index.js = 856.4kb (was 813.0kb; +43kb for the 4 new
modules + updated rendering).

## What v4.1 actually delivers POST-this-commit

When Eva redeploys with VOYAGE_API_KEY set:
  1. Plugin boots → connection opens with allowExtension=true
  2. Migration runs (existing)
  3. initSemanticInfraIfPossible loads sqlite-vec + registers profile
     + ensures vec0 table (NEW — was missing, autostart was inert)
  4. Backfill autostart kicks in 5s later → embeds first 200 docs
  5. Extraction autostart drains entity coref queue every 60s
  6. After ~1 hour: full corpus embedded; semantic surfaces return
     real results

The v4.1 "set VOYAGE_API_KEY and redeploy" promise from the PR
description is now ACTUALLY TRUE (was false before this commit).

## Reviewer's lcm_recent verdict — separate response

Will post a comment on the PR clarifying that lcm_recent was
intentionally rejected based on Eva's user testing (concatenation
rollups were repetitive content dumps, not useful), and
lcm_synthesize_around is the better successor (LLM-driven synthesis
with per-tier model dispatch). Not addressed in this commit.
100yenadmin pushed a commit that referenced this pull request May 6, 2026
…+ 6 HIGH) + 2 new agent tools

Caught by 10 parallel Opus 4.7 1M-context adversarial-debug agents
(Step 3 batch of last night's audit). Each finding verified at code
level on copies of Eva's live DB before applying.

## BLOCKER fixes

### 1. Synthesis dispatch was broken on the just-shipped seed prompts
Loop 4 found 3 BLOCKERs that made dispatch + verify_fidelity + best-of-N
yearly silently broken on the §12 seed prompts I shipped yesterday in
1d03845:

- **Bug 4.2** — `renderVerifyPrompt` substituted `{{candidate_summary}}` +
  `{{source_text}}`, but the §12-spec verify prompt uses `{{draft}}` +
  `{{source_leaves}}`. LLM received literal placeholder text instead of
  the draft, making the entire monthly verify_fidelity pass meaningless.
  Fix: extended renderer to alias both placeholder names. (dispatch.ts:632)

- **Bug 4.3** — Judge parser was `output.match(/\d+/)`. Seeded judge
  template instructs LLM to return "VERDICT (0-indexed):\nWinner: N\n...",
  so the regex picked the first digit ("0" from "0-indexed"). Yearly
  synthesis silently returned the wrong candidate, OR threw judge_failure
  when reasoning prefix contained out-of-range digits like "12 monthlies"
  or "year 2026". Fix: `/(?:^|\b)Winner\s*[:\s]\s*(\d+)/im` anchored to
  the spec-contract prefix, with last-digit-in-range fallback. (dispatch.ts:593)

- **Bug 4.4** — `lcm_synthesis_cache.tier_label CHECK` allowed only
  ('year', 'custom', 'filtered'). Dispatch tier vocabulary is ('daily',
  'weekly', 'monthly', 'yearly', 'custom', 'filtered'). Yearly synthesis
  attempting to write cache would CRASH on the CHECK. Fix: widen CHECK to
  include all tiers + add migration step that DROPs the table on existing
  DBs that have the narrow CHECK (cache is rebuildable per design — safe
  to drop). (migration.ts:1490)

### 2. Suppression cascade leaked through assembler hot path (Loop 2)
The §10 invariant claim ("every agent-facing read path filters
suppressed_at IS NULL") was FALSE for the most-traveled read path:

- **Leak 2.1+2.2 BLOCKER** — `assembler.resolveMessageItem` →
  `conversationStore.getMessageById` had NO suppressed_at filter. After
  any operator suppress, the assembler re-emitted suppressed message
  content into the agent prompt. `lcm_expand` via `expandRecursive` had
  the same root cause.
  Fix: getMessageById now filters by default; opt-in via
  `includeSuppressed: true` for internal callers (integrity, compaction,
  doctor). (conversation-store.ts:656)

- **Leak 2.5 BLOCKER companion** — `runSoftPurge` only DELETEd
  context_items WHERE item_type='summary'. Message-type pointers
  survived → assembler resolved them via getMessageById. Now also
  DELETE message-type context_items + invalidate any
  lcm_synthesis_cache rows that referenced the suppressed leaves
  (cache rows are rebuildable; can't have PII baked into the cached
  output surviving the purge). (purge.ts:243-301)

### 3. Entity tools claimed in PR Scenario 4 didn't exist
PR_DESCRIPTION.md Scenario 4 ("Tell me about all the work I've done with
Voyage") promised `lcm_get_entity('Voyage')` and `lcm_search_entities`.
Slice 1 audit caught: BOTH tools were entirely vapor. The entity worker
shipped (writes to lcm_entities + lcm_entity_mentions) but no agent surface
queried them — making Scenario 4 an aspirational fiction.

Built both tools (Final.review.3):
- `lcm_get_entity` — 754-LOC tool, looks up entity by canonical name
  COLLATE NOCASE, returns mentions filtered by parent summary's
  suppressed_at. Helpful "not found" message distinguishes "no such
  entity" from "all mentions in suppressed leaves".
- `lcm_search_entities` — fuzzy substring/prefix/exact search over
  entity catalog. Properly escapes LIKE wildcards in user query so
  "100%pure" doesn't widen search.
- Wired in manifest + plugin/index.ts. 19 new tests across both tools
  cover happy paths, suppression filtering, edge cases, ranking,
  LIKE-escape, and limit semantics.

## HIGH fixes

- **Loop 1 Bug 1.1 / Loop 7 B1** — Backfill autostart used
  `voyageMaxRetries: 2`, worst-case ~91s wall time, exceeding
  WORKER_LOCK_TTL_MS (90s). Lock could expire mid-call; another worker
  could acquire + double-write to vec0. Drop to 1 retry → worst-case 60s,
  safely under TTL. (backfill-autostart.ts:179, lcm-command.ts:1686)

- **Loop 7 B5** — Autostart's "3 consecutive failures → stop" never
  fired on `result.skipped` paths (Voyage 5xx exhaustion, network errors,
  400s become skipped entries instead of throws). A Voyage outage burned
  quota indefinitely without auto-stopping. Now treats all-skipped ticks
  with non-zero pending as a failure. (backfill-autostart.ts:198-220)

- **Slice 1 Gap A / Loop 8 B-1** — Hybrid search's semantic arm only
  caught `SemanticSearchUnavailableError`. Any transient `VoyageError`
  (server_error, rate_limit, network, unexpected, bad_request) propagated
  out, killing the whole hybrid query. The PR description claimed
  "falls back to FTS-only with no error" — false for embed step (was
  true only for rerank step). Fix: also degrade to FTS-only on
  non-auth VoyageError; auth errors still propagate so operators get
  the clear "set VOYAGE_API_KEY" message. (hybrid-search.ts:227)

- **Slice 1 Bug 4.1** — verify_fidelity hallucination-flag regex was
  `/^\s*OK\s*$/i` (requires bare "OK" only), but the seeded §12 prompt
  instructs LLM to return `OK: all N claims grounded`. Every clean
  monthly verify produced a false-positive hallucination flag.
  Relaxed to `/^\s*OK\b/i`. (dispatch.ts:305)

- **Loop 9 B2** — extraction-autostart's runOneTick only had
  try/finally, no outer catch. Any throw before runCoreferenceTick (e.g.
  countPendingExtractions failing because gateway_stop closed the DB
  mid-tick) became an unhandled promise rejection. Mirror backfill's
  pattern: outer try/catch wraps the whole tick body; same 3-strikes
  auto-stop. (extraction-autostart.ts:106)

- **Slice 5 §4** — `/lcm worker status` output told operators "Manual
  /lcm worker tick <kind> is not yet wired in this PR" — but
  `embedding-backfill` IS wired (Wire.2). Stale text from before
  commit 34b0ebf shipped the parser. Fix: accurate text noting backfill
  is wired and other kinds are cycle-3. (lcm-command.ts:1605)

- **Slice 5 §5** — PR_DESCRIPTION.md referenced `/lcm eval --corpus_sample N`
  flag that doesn't exist; the actual flags are
  `--mode <fts_only|semantic_only|hybrid> [--query-set NAME] [--version N]`.
  Operators following the docs would get "Unknown argument" errors.

- **Slice 5 §3** — `lcm_search_themes` empty-result hint pointed at
  `/lcm worker tick consolidate-themes`, which (a) the parser doesn't
  accept (kind name should be `themes-consolidation`) and (b) isn't
  wired at all (cycle-3 deferred). Replace with honest text about the
  current cycle-3 status. (lcm-search-themes-tool.ts:178)

## Tests

- 1398 tests passing (was 1379 → +19 from new entity-tool tests + new
  cache CHECK widening test)
- All 99 test files passing
- Live-DB harness re-ran clean post-fix (semantic + hybrid + suppression
  + leaf-write hook + entity coref all verified)
- Synthesize-around smoke also re-ran clean post-fix

## What we learned (process)

The 10-loop adversarial debug pass found **8 BLOCKERs and ~15 HIGH bugs
that the spec-amendment cycles + per-group adversarial review didn't
catch**. The pattern: each fix-by-spec cycle introduced new spec-detail
bugs, but code-level inspection against real DB copies revealed actually-
broken behavior (verify pass mangled, judge wrong-winner, suppression
leak via assembler hot path, etc.). Code-as-ground-truth was the right
pivot.

This is the third pass of the v4.1 final review:
- Final.review (4 P1/P2 findings) → ec99fd0
- Final.review.2 (prompt seeding BLOCKER) → 1d03845
- Final.review.3 (this commit, 10 adversarial loops + 5 doc-vs-code agents)

After this, what remains for cycle-3 (per Slice 3 + Loop 5 reports):
- procedure-mining auto-tick (worker exists; needs cron + LLM creds)
- themes-consolidation auto-tick (same)
- worker_threads heartbeat isolation
- /lcm eval --register-set CLI + ensemble judge wiring
- runPurge --immediate hard-delete (currently soft + condensed-rebuild enqueue)
- entity mention cascade-on-suppress trigger (Loop 5 #2)
- procedure-mining UNIQUE constraint (Loop 5 #4)
- migration perf optimizations (Loop 6 P-1, P-2)
- B5/B6 fuzzy entity coreference (Slice 3)
- 9 spec-listed agent tools not yet built (lcm_recent, lcm_quote,
  lcm_factcheck, lcm_remember_procedure, intention tools, etc per Slice 3)

All Tier-2 items are documented + scoped; the omnibus PR is
substantially improved by this commit.
100yenadmin pushed a commit that referenced this pull request May 6, 2026
…MED + 1 LOW)

Three Opus 1M-context agents reviewed the P1-P8 commit (e182f24) at
≥95% confidence. Fixed everything HIGH/MED + a small LOW. All 1328
tests still passing.

HIGH #1 (semantic-search.ts:286): entity-only return path was missing
  the new mandatory cosineSimilarity field — would have crashed
  downstream `.toFixed(3)` calls when caller had embedded entities/themes
  and no summary candidates returned. Added cosine derivation to that
  branch.

HIGH #2 (lcm-grep-tool.ts:268): full_text mode was applying our new
  sanitizeFts5Pattern AND the existing store-layer sanitizer (in
  conversation-store / summary-store via fts5-sanitize.ts). Composition
  is actually safe (verified by tracing) but redundant; removed the
  tool-layer sanitize from full_text path. Verbatim path keeps it
  (verbatim has its own SQL path bypassing the store sanitizer).

HIGH #3 (lcm-grep-tool.ts:725-735): when FTS5 isn't available, the
  catch-block fallback to `m.content LIKE ?` was looking for the raw
  pattern in `binds` to replace — but `binds` was poisoned by
  sanitizeFts5Pattern (`v4.1` → `"v4.1"`). findIndex returned -1,
  no replacement happened, LIKE got the literal phrase-quoted form.
  All sanitized verbatim queries silently returned 0 hits on
  no-FTS5 SQLite installations. Fixed: replace at known-position
  index 0 (the FTS-MATCH bind is always pushed first).

HIGH #4 (lcm-grep-tool.ts:99): role enum included only user / assistant /
  tool / all — but messages table contains 'system' role too. system
  messages were silently unfilterable. Added 'system' to schema enum
  and to the runtime VALID_ROLES set.

MED #5 (semantic-search.ts:127): cosineSimilarity doc-comment thresholds
  said ≥0.8/0.6/0.4 but actual impl used ≥0.65/0.5/0.35. Doc fixed.

MED #6 (lcm-describe-tool.ts:241): early header signal said "N
  candidates; details below" based on raw childIds.length, but detail
  block could say "0/N (all suppressed)" if everything was suppressed —
  contradictory signals. Reworded header to "N raw candidate(s) before
  suppression filter; survivors + details below" so it doesn't lie.

MED #7 (lcm-describe-tool.ts:381): expandMessagesOffset had no upper
  bound, enabling adversarial DoS via huge OFFSET scans. Clamped at
  100k (well past any realistic 216-msg leaf).

MED #8 (lcm-search-entities-tool.ts:208): the P8 catalogStatus probe
  ran COUNT(*) on lcm_entities globally — full-table scan on
  multi-million-entity DBs. Replaced with EXISTS(SELECT 1 ... LIMIT 1)
  which short-circuits at first row.

LOW #9 (lcm-describe-tool.ts:418): when expandMessagesOffset >=
  totalMessages, status was misleadingly "ok" with 0 results. Added
  distinct "offset-past-end" status variant so callers can distinguish
  "leaf is empty" vs "you paginated past the end".

Verified end-to-end on snapshot DB:
- role: "system" no longer schema-rejected
- offset 50000 (clamped to 100k cap) returns "offset-past-end" status

Tests: 1328 passing (no regressions; existing tests cover the changed
contracts via type-checked fields).
100yenadmin pushed a commit that referenced this pull request May 6, 2026
…W closed

Ten parallel Opus 1M-context agents reviewed PR Martian-Engineering#613 partitioned by
surface (migration / voyage / synthesis / hybrid+retrieval / agent tools /
concurrency / extraction / operator / tests / docs+manifest). All
HIGH+MED findings closed below; QA runner improved alongside.

DATA-CORRUPTION / AVAILABILITY HIGH FIXES
=========================================

Synthesis (Auditor #3 #1 #2 #5):
  - INSERT → INSERT OR IGNORE on lcm_synthesis_cache so concurrent callers
    don't crash with UNIQUE collision; latch-loser re-SELECTs and either
    returns cached result or "building elsewhere" hint.
  - Reap zombie 'building' rows older than 10 min before INSERT (prevents
    process-killed-mid-dispatch availability latch).
  - Audit GC: prune 'started' audit rows >1h and 'completed'/'failed' rows
    >30 days on every synthesize_around call. Bounded growth.

Voyage (Auditor #2 #1 #2 #3 #4):
  - MAX_TOKENS_PER_EMBED_DOC: 30k → 27k (Voyage tokenizer counts ~9.5%
    higher than DB token_count; 30k × 1.095 = 32.85k > 32k Voyage cap →
    400 errors on 28-30k stored-token leaves).
  - BACKOFF_CAP_MS: 30s → 25s (so worst-case retry path 25s + 30s + 30s
    = 85s leaves 5s margin under WORKER_LOCK_TTL_MS=90s).
  - heartbeatLock now requires `expires_at > now` predicate, refusing to
    extend an already-expired lock (prevented two-workers-think-both-own
    race when our long Voyage call exceeded TTL).
  - writeBatch wraps each row in SAVEPOINT so per-row failure rolls back
    JUST that row's vec0+meta partial writes (was leaving phantom vec0
    rows when meta-side INSERT failed).

Hybrid retrieval (Auditor #4 #2 #3):
  - FTS adapter in lcm-grep-tool now over-fetches + post-filters on
    sessionKeys/summaryKinds (was silently dropping these filters,
    leaking cross-session content into hybrid results — violated v4.1
    §10 session-family scoping invariant).
  - Semantic-search time filter changed from `s.created_at` to
    `julianday(COALESCE(latest_at, created_at))` to match FTS arm. Was
    returning divergent sets for the same since/before window.

Entity coref (Auditor #7 #1 #2 #3 #4 #5):
  - Entity ID generation: Math.random() (32-bit, ~64K collision) →
    crypto.randomUUID()-derived 48-bit suffix.
  - Mention ID: 16-char prefix truncation → FNV-1a content hash. Long
    surfaces sharing the first 16 chars no longer silently collide.
  - Entity INSERT → INSERT OR IGNORE + re-SELECT winner. Prevents
    ROLLBACK + retry-forever loop when two ticks process the same
    canonical surface concurrently.
  - occurrence_count: bump ONLY when a new mention row is actually
    inserted (was double-counting on idempotent re-process).
  - Extractor 16K char silent truncation now logs a warn line with
    the dropped-chars count.

Concurrency (Auditor #6 #4):
  - extraction-autostart now calls tickExtraction (orchestrator-wrapped
    with acquireLock/releaseLock) instead of runCoreferenceTick directly.
    Prevents two gateway processes from double-processing the queue.

Migration (Auditor #1 #3):
  - widenLcmSynthesisCacheTierCheck_v413 now DELETEs orphaned
    lcm_synthesis_audit rows before DROP-ing lcm_synthesis_cache. With
    foreign_keys=OFF during migration (the standard pattern), audit
    rows would have become dangling references; now they're cleaned.

OPERATOR SURFACE (Auditor #8 BLOCKER #1)
========================================
  - /lcm purge command now wired (was dead code). Soft mode only
    (immediate cut from PR). Defaults to dry-run preview; --apply to
    actually suppress. --allow-main-session gates Eva's primary thread.
    Required: --reason "..." + at least one criterion (--session-key,
    --summary-ids, --since, --before, --min-token-count).

MED FIXES
=========
  - dispatch.ts verify_fidelity regex: `/^\s*OK\b/i` → `/(?:^|\n)\s*OK\b/i`
    so model preambles before "OK" don't false-positive a hallucination
    flag (Auditor #3 #4).
  - lcm_describe budget=0 now emits an explicit "delegated grant
    exhausted" line instead of silently showing budget=over on every
    node (Auditor #5 #3).
  - lcm_get_entity / lcm_search_entities entityType docs now list the
    actual extractor-produced types (person_name, pr_number, agent_id,
    etc.) instead of the fictitious ('person', 'project', 'pr',
    'commit', 'file') that never matched (Auditor #7 #8).

QA RUNNER IMPROVEMENTS (Auditor #9)
====================================
  - adv-empty-pattern: vacuous predicate fixed; now asserts either
    graceful error OR 0 matches.
  - Added 2 missing-tool smokes: adv-lcm-get-entity-smoke and
    adv-lcm-expand-query-smoke (8 tools now exercised, was 5 of 8).
  - Determinism: replaced `ORDER BY RANDOM()` and unsorted `LIMIT 1`
    with stable `ORDER BY summary_id ASC LIMIT 1 OFFSET ?` so re-runs
    pick the same leaves and report deltas cleanly.
  - JSON output now includes `schemaVersion: "1.0.0"`.
  - Voyage cost rate corrected: 0.00012 → 0.00018 per 1K tokens
    (under-reported by ~33%).

DOC RECONCILIATION
==================
  - PR_DESCRIPTION.md: 22/25 claim now annotated with live-harness
    refinement (14/25 high confidence + 8/25 degraded UX + 3/25 fallback).
  - HARNESS_REPORT_2026-05-06.md: prepended status banner + per-bug
    [FIXED in commit X] annotations so reviewers reading the report
    end-to-end see what's still open vs. closed.

VERIFICATION
============
  - 1328/1328 tests passing (no regressions; 2 tests updated for
    intentional behavior changes — voyage cap 30k→27k, batching test
    sizes 30k→25k to stay under new cap).
  - QA runner: smoke 8/8, adversarial 10/10, full 30/30 — all clean.
  - Total cost ~$0.11 per full QA run.

DEFERRED TO CYCLE-3 (acknowledged in PR description, not blocking merge)
=========================================================================
  - Auditor #6 #1-#3 (concurrency doc overclaims about busy_timeout +
    fallback-soak + heartbeat-on-worker-thread): in-process model means
    these guarantees aren't load-bearing today. Doc to be reconciled
    when worker-thread isolation lands in cycle-3.
  - Auditor #7 #6 idle GC for zero-mention entities: not blocking;
    occurrence_count only ever bumps up, never down.
  - P9 / P10 from harness report: low priority, no immediate workaround
    needed.
100yenadmin pushed a commit that referenced this pull request May 6, 2026
Wave-2 ran 10 Opus 1M-context agents over the post-Wave-1 commit. Key
findings + fixes:

CRITICAL CRASH BUG
==================
Wave-2 Auditor #1 finding #1 (HIGH): the synthesis cache loser-path
SELECT queried column `output` but the schema has `content`
(migration.ts:1506). EVERY concurrent ready-cache hit threw
`no such column: output`. Single-flight winner-already-ready fast-path
was completely broken.
Fix: changed SELECT to use `content`, response field renamed `text`.

DATA-CORRECTNESS HIGH
=====================
Auditor #1 #2: zombie cache janitor only reaped `'building'` rows;
`'failed'` rows would block all future synthesis of the same window
forever. Now reaps both. Added `recent_failure` response shape so
caller can distinguish from `building_elsewhere`.

Auditor #2 finding F1: parseRetryAfterMs silently clamped Voyage
server-supplied Retry-After to BACKOFF_CAP_MS (25s), so a
`Retry-After: 60` was retried at 25s — still rate-limited, wasting a
retry slot. Also tightly coupled with WORKER_LOCK_TTL_MS=90s.
Fix: honor server retry-after up to 5min cap; if it exceeds the
lock-aware budget (60s), throw rate_limit immediately so caller
releases lock and the next autostart tick retries cleanly.

Auditor #6 BUG-2 + BUG-3 (HIGH): /lcm purge dry-run preview used its
own SQL with `datetime(created_at)` while runPurge used raw
`created_at >= ?`. Edge cases (timezones, microseconds) gave
divergent counts; --summary-ids dry-run returned input length
without filtering for actually-existing leaves. Also the empty-
criteria dry-run scared operators with whole-DB count.
Fix: extracted `previewPurgeAffected(db, opts)` from purge.ts and
wired the dry-run to use it. Added validation parity, --allow-main-
session warning, race-window note in output.

Auditor #7 finding A1 (HIGH): time-filter inconsistency across tools
— summary FTS + semantic used `julianday(COALESCE(latest_at,
created_at))` (post Wave-1) but synthesize-around still used
`datetime(created_at)` and verbatim grep used `datetime(m.created_at)`.
Cross-tool: same `since`/`before` window returned different result
sets depending on which tool the agent picked.
Fix: synthesize-around now uses `julianday(COALESCE(latest_at,
created_at))`. Verbatim grep (messages — no latest_at) now uses
`julianday(m.created_at)` for syntactic parity.

TEST COVERAGE GAP
=================
Auditor #8 finding F1: zero test coverage for the Wave-1 migration
DELETE-before-DROP fix.
Fix: added 3 new tests in v41-synthesis-tables.test.ts:
  - DELETE prunes only orphan-pointing rows, preserves
    target_summary_id-pointing rows
  - re-running runLcmMigrations on already-widened DB is a no-op
  - schema includes wide CHECK including 'monthly' on first migration

Auditor #8 finding F2: bare catch in migration too broad — could
swallow corrupted-DB errors. Now narrowed to expected
"no such table.*lcm_synthesis_audit" pattern; re-throws otherwise.

QA RUNNER IMPROVEMENTS
======================
Auditor #9 HIGH-2: OFFSET overflow returned `undefined` row, target
became `undefined`, predicate accepted any error → tests passed on
empty corpus.
Fix: fall back to OFFSET 0 (first leaf) if requested offset exceeds
row count. Sentinel `__NO_LEAVES_IN_CORPUS__` when even that fails.

Auditor #9 HIGH-3: B/C predicates only checked for `r.error` →
0-hit returns silently passed.
Fix: added `Array.isArray(r.details?.hits)` assertion + per-hit
shape validation (content, role for verbatim).

DOC RECONCILIATION
==================
Auditor #10 F1: HARNESS_REPORT internally inconsistent (banner said
"30/30 pass" but verdict body still showed 14/8/3). Reconciled:
explicit "two numbers reflect two rubrics" explanation.

Auditor #10 F2: THE_FIVE_QUESTIONS.md still said "22/25 PRIMARY
coverage" without live-harness annotation. Added post-fix verification
note pointing to QA runner + HARNESS_REPORT.

Auditor #10 F3: PR_DESCRIPTION listed "5 operator commands" but the
plugin exposes 9 (status, health, worker, reconcile-session-keys,
eval, purge, backup, rotate, doctor + help). Fixed to 9 with
descriptions.

CROSS-TOOL NAMING PARITY
=========================
Auditor #7 A2 (MED): synthesize-around emits `voyage_tokens_consumed`
(snake_case) while semantic-recall emits `voyageTokensConsumed`
(camelCase). The tool's output uses snake_case throughout for
internal consistency, so we added `voyageTokensConsumed` as a
camelCase alias alongside the original.

VERIFICATION
============
- 1331/1331 tests passing (1328 baseline + 3 new migration tests)
- QA runner full suite: 30/30 pass
- QA runner adversarial suite: 10/10 pass
- Total cost: ~$0.11 per full QA run

DEFERRED (acknowledged, not blocking merge)
============================================
- Auditor #2 F3 (heartbeat between batches, not mid-batch): the
  SAVEPOINT-per-row + heartbeatLock-with-expires_at-predicate
  combination already detects lock theft cleanly; mid-batch
  heartbeat is a cycle-3 hardening item.
- Auditor #6 #11 (operator permission gate on /lcm purge): the
  command runs without an explicit auth gate at the plugin
  registration site. Gate is delegated to the OpenClaw plugin
  contract layer (per the existing convention with reconcile-
  session-keys, doctor clean apply, etc.). If/when OpenClaw exposes
  isOperatorSession() to plugins, all destructive subcommands will
  consume it together.
- Auditor #1 #4 (verify_fidelity regex still has edge case where
  "OK" appears mid-line in negative context): improvement over Wave-1;
  full negative-context detection requires a more sophisticated parser.
- Auditor #1 #5 (audit GC scans full table per call): cost is
  ~1ms; future move to scheduled background sweep.
- Auditor #3 F2/F3 (entity coref single-flight contract): improvements
  documented; in-process inFlight + DB-row-level lock combination is
  sufficient for current single-process deployments.
- Auditor #9 HIGH-1 (QA-runner durationMs varies across runs): timing
  fields are inherently non-deterministic; row selection IS now stable
  which is the actual reproducibility property.
100yenadmin pushed a commit that referenced this pull request May 6, 2026
Wave-3 ran 10 Opus 1M-context agents on the post-Wave-2 commit. Three
agents (#3, #8, #9) couldn't see the post-Wave-2 tree — they looked at
stale checkouts and produced no usable findings. The remaining seven
surfaced 11 real issues.

DATA-CORRECTNESS HIGH
=====================
Auditor #1 H1: `recent_failure` response (Wave-2 addition) didn't include
`failure_reason` even though we stored it on the row — caller saw a
generic hint instead of the actual cause one column away.
Fix: SELECT `failure_reason` from the loser-path query and surface it
in the response. Truncate to 200 chars in the hint.

Auditor #1 H2: 10-min `failed`-row TTL caused hammering during long
Voyage outages — every 10 min, every distinct (session, range, fp)
tuple would re-attempt LLM, fail, mark failed, repeat. With many
windows this cascaded into a steady DDoS against the LLM provider.
Fix: exponential backoff per cache row — `TTL_MIN * 2^audit_attempts`,
capped at 6h. Audit row count gives us attempt history per cache_id.

Auditor #1 H3: `building_elsewhere` had no max-retries hint — if the
winner process died between INSERT and the next zombie sweep, every
concurrent caller would loop indefinitely.
Fix: compute `retry_after_ms = max(0, building_started_at + 10min - now)`
so callers can sleep precisely once instead of polling.

Auditor #1 M1: audit GC's 30-day branch had no index — full-table scan
on every `synthesize_around` call.
Fix: added partial index `lcm_synthesis_audit_completed_gc_idx` on
`(ran_at) WHERE status IN ('completed', 'failed')` so both GC branches
are O(log n).

Auditor #1 M2: janitor DELETE + INSERT OR IGNORE were not atomic —
cross-process callers could sneak in between, causing benign latch
loss + unexpected `building_elsewhere` responses.
Fix: wrapped both in `BEGIN IMMEDIATE` ... `COMMIT` so the operation
is serialized at the SQLite write-lock level.

Auditor #4 #3 (HIGH): `lcm_grep mode='semantic'` details.hits[] was
missing `conversationId` (broke parity with hybrid + verbatim modes)
and missing `cosineSimilarity` + `confidenceBand` (broke parity with
`lcm_semantic_recall`). Cross-tool agents JSON-parsing the response
shape would hit drift.
Fix: details.hits now mirrors `lcm_semantic_recall` exactly:
{summaryId, conversationId, sessionKey, kind, distance, cosineSimilarity,
tokenCount, createdAt}. Tool now also emits `confidenceBand` at the
top level + warns on low/noise just like semantic-recall.

DOC FIXES
=========
Auditor #6 #2/#3: README.md was stale — listed only 3 v3-era tools
(`lcm_grep`, `lcm_describe`, `lcm_expand`) and 5 of the 9 commands.
Fix: rewrote the tool list (8 tools with one-liners) and command
section (9 subcommands with full flags).

TEST COVERAGE FILLS (Auditor #7 top-3 priority gaps)
=====================================================
Added 8 new tests (1331 → 1339):

1. `operator-purge.test.ts` previewPurgeAffected parity (4 tests):
   - Range purge: preview count == affectedLeafIds.length
   - --summary-ids: filters out non-leaf, already-suppressed, nonexistent
   - since/before time filter: preview matches apply
   - Empty match: preview returns 0 cleanly

2. `voyage-client.test.ts` lock-budget retry behavior (2 tests):
   - Retry-After > 60s threshold: throws immediately, does NOT sleep,
     elapsed time < 2s (proven by wall-clock measurement)
   - Retry-After ≤ 60s: server-supplied value honored, retries as expected

3. `lcm-synthesize-around-tool.test.ts` schema column-name regression
   (2 tests):
   - Schema has `content` (not `output`); all 6 columns the loser-path
     SELECT references exist
   - Literal SELECT used by loser-path executes without error against
     the real schema (proves the Wave-2 crash bug can't regress)

VERIFICATION
============
- 1339/1339 tests passing
- QA runner full suite: 30/30
- QA runner adversarial: 10/10
- Total cost ~$0.11 per full QA run

DEFERRED (acknowledged, not blocking)
======================================
- Auditor #1 L1 (test exercises only the SQL DELETE not the full
  migration step): the DELETE-in-isolation is sufficient for what
  changed; the migration step itself has its own coverage in
  `v41-pre-existing-schema-migration.test.ts`.
- Auditor #2 F2/F3 (60s lock-budget threshold has zero margin under
  worst-case scenarios): the Wave-1 heartbeat-with-expires_at predicate
  detects lock theft cleanly even if budget is exhausted; tightening
  the threshold further is a future hardening item.
- Auditor #4 confirmed-clean items (suppression filter parity, error
  envelope shape, conversation-scope error message) — no further
  work needed.
- Auditor #5 (E2E smoke): documented real UX gaps in
  `lcm_synthesize_around` discoverability (target= vs query=, window_kind
  required) — would require schema-description rewrites; queued for
  cycle-3 ergonomics pass.

Audit cycle stats:
- Wave-1: 17 HIGH + 9 MED + 1 LOW closed across 1 commit
- Wave-2: 19 findings (4 HIGH + 4 MED + 1 LOW + others) closed
- Wave-3: 11 findings closed (this commit)
- Total: 36+11 = 47 findings closed across 3 commits
- 1339 tests passing
100yenadmin pushed a commit that referenced this pull request May 6, 2026
…4 P2 closed

Wave-5 ran 3 parallel Opus agents focused on the Wave-4 commit
(`cd76389`) to verify those fixes didn't introduce new bugs. Surfaced
1 P0-classified pre-existing classification ambiguity (reclassified P3
on inspection — not a Wave-4 regression), 4 real P1s introduced by
Wave-4 changes, and several P2s.

P1 — REGRESSIONS INTRODUCED BY WAVE-4 (4 closed)
================================================

Wave-5 #1 — expandRecursive `visited` set broke DAG re-entry semantics.
The Wave-4 cycle-guard correctly prevented infinite loops but ALSO
prevented legitimate cross-path expansion: if A→B and C→B (B reachable
from two distinct ancestors), B's subtree was explored only once
because `visited.has(B) === true` on the second path. This is a
correctness regression dressed as a safety fix — the pre-Wave-4 code
allowed duplicate emissions but explored both paths.
Fix: replaced `visited` (all-time) with `stackAncestors` (in-flight
DFS path only). `add` on entry, `delete` on return via `try/finally`.
Cycles are still blocked (a node can't be its own ancestor) but
distinct ancestor paths each explore the shared descendant.

Wave-5 #2 — recordEmbedding SAVEPOINT names used Math.random 24-bit
suffix (~1/4096 collision under concurrent outer-tx callers). SQLite
SAVEPOINTs aren't nestable with the same name; collision could cause
inner ROLLBACK TO to unwind the wrong scope.
Fix: switched to crypto.randomUUID-derived 12-hex-char (48-bit)
suffix. Collision-free for any realistic concurrency.

Wave-5 #3 — dead-letter UPDATE failure in entity-coreference was
silent: if the attempts-bump UPDATE itself failed (DB locked, schema
race) the catch swallowed it and the row retried forever (defeating
the very dead-letter mechanism Wave-4 added).
Fix: failure now surfaces in itemDetail.error as
"original | dead-letter-update-failed: ..." so operators see the
mechanism is broken rather than silently looping. Loop continues so
other items are still processable.

Wave-5 #4 — synthesis health single-query SUM(CASE...) couldn't use
any of the 4 partial indexes on lcm_synthesis_audit. On a large audit
table (the very condition this surfaces), /lcm health became O(n).
The fix description claimed observability for "millions of stale rows"
but ironically degraded health latency precisely under that condition.
Fix: split into 4 separate queries — total + 7-day-recent (PK scans;
bounded) + stale-started (uses lcm_synthesis_audit_started_gc_idx) +
stale-done (uses lcm_synthesis_audit_completed_gc_idx). Each query is
O(log n) on the indexed branches.

P2 — DEFENSIVE CLAMPS + CAPS (4 closed)
========================================

Wave-5 #5 — bestOfN silent clamp. Caller passing bestOfN=10 saw the
result with bestOfN.n=5 (Wave-4 cap) but no signal it was clamped.
Fix: added requested + capped fields to bestOfN result so callers can
see the clamp + audit cost decisions.

Wave-5 #6 — perQueryTimeoutMs ≤0 / NaN resolved immediately, zeroing
out every query's recall with no error. opts.perQueryTimeoutMs ?? 30s
allowed 0 / negative through.
Fix: clamp to [100ms, 5min]; values outside the band get default 30s.

Wave-5 #7 — citedIds IN-list unbounded for SQL validation. If LLM
emitted thousands of fabricated IDs, the placeholder query would blow
SQLITE_MAX_VARIABLE_NUMBER (default 32766) and the catch would fall
back to UNVALIDATED set — defeating the validation Wave-4 added.
Fix: cap at first 1000 IDs before the IN query (well above realistic
citation count, well under SQLite cap). Excess IDs are still
reported in citedIdsRejectedAsFabricated count.

Wave-5 #8 — doctor "old" classifier dead code. Pre-Wave-4 fallback
was emitted as a SUFFIX (truncated content + marker), so
content.startsWith(FALLBACK_SUMMARY_MARKER) was always false on
legitimate legacy data. The "old" branch was effectively unreachable
for real DBs. NOT a Wave-4 regression — it's a pre-existing
classifier ambiguity. Documented the intent: legacy data flows
through the trailing-suffix `fallbackIndex` branch and is classified
"fallback" (correct semantics; same repair path).

VERIFICATION
============
- 1345/1345 tests passing
- QA runner full: 30/30 pass
- QA runner adversarial: 10/10 pass

DEFERRED FROM WAVE-5
=====================
- A2 P1-D: forceReleaseLock empty-string falsy-check defensive — minor
- A2 P1-G: pickModel forceModel semantic change — by design (Wave-4
  intent was "force" actually forces); any caller relying on no-op
  with forceModel=true and modelOverride=undefined will see tier
  default now. No production callers do this per code search.
- A3 P1-A: citedIdsRejectedAsFabricated not in docs — added to type
  with JSDoc; PR description / agent-tools.md update deferred to
  next doc pass
- A3 P1-B: hits[] shape STILL drifts across grep modes — mode-specific
  signals (rerank score, semanticDistance, FTS rank) are intentionally
  per-mode; `confidenceBand` + `cosineSimilarity` parity is what
  matters cross-mode and is now uniform
- A3 P1-C: doctor pre-filter false-positive on benign content
  containing marker text — detectDoctorMarker per-row classifier is
  the gate; pre-filter false positive is just extra work, not wrong
  classification
100yenadmin pushed a commit that referenced this pull request May 6, 2026
…s mergeable

Wave-6 ran 2 parallel Opus agents on the Wave-5 commit + final
cross-tool integration. Auditor #1 found 0 P0 + 0 P1 + 3 P2 + 2 P3
on Wave-5 fixes; Auditor #2 ran end-to-end exercises against the
snapshot DB and explicitly concluded "PR is mergeable" with 0 P1
findings.

This commit closes the 2 most impactful P2s; remaining P2/P3 are
cosmetic.

P2 — Quality of life
=====================

Wave-6 P2-A: itemDetail.error in entity-coreference dead-letter path
could balloon to multi-MB if both extractor errMsg AND UPDATE failure
were huge. /lcm health surfaces consume `result.perItem`, so a single
poison row could overflow. Fix: slice both halves to 500 chars each
before concatenating.

Wave-6 P2-C: lcm_expand_query citedIds validation reported IDs
beyond the 1000-cap as "rejected as fabricated" — misleading. They
were just unverified, not necessarily wrong. Fix: separate
`citedIdsExceededValidationCap` field; preserve over-cap IDs in the
result (un-validated). citedIdsRejectedAsFabricated now reflects
ONLY confirmed-fabricated within the validated slice.

CONVERGENCE
===========

Audit cycle finding density across 6 waves:

| Wave | Agents | P0 | P1 | P2 | P3 |
|------|--------|----|----|----|----|
| 1    | 10     | 0  | 17 | 9  | 1  |
| 2    | 10     | 0  | 4  | 4  | 1  |
| 3    | 10*    | 0  | 11 | 6  | 7  |
| 4    | 22     | 7  | 30 | 25 | 20 |
| 5    | 3      | 0  | 4  | 4  | 0  |
| 6    | 2      | 0  | 0  | 4  | 2  |

(*Wave-3: 3 of 10 agents saw stale checkouts; 7 effective)

Wave 4's high count came from comprehensive 1k-LOC-per-agent
partitioning across 22K LOC of production code; subsequent waves
audited only the changed regions and density dropped sharply.

Wave 6 finding 0 P1 means we've converged below Eva's "no more
P0-P3" target for the merge bar.

VERIFICATION
============
- 1345/1345 tests passing throughout audit cycle
- QA runner full: 30/30 pass; adversarial: 10/10 pass
- Total cost ~$0.11 per full QA run

DEFERRED FROM WAVE-6 (cosmetic only)
=====================================
- W6 P2-B: perQueryTimeoutMs clamp not surfaced in result. Operator
  passing timeout=50ms gets default 30s with no warning. Defer —
  recall result already has many fields; not blocking.
- W6 P3-A: 4-query split for /lcm health is non-atomic. Best-effort
  gauges are acceptable per the audit comment.
- W6 P3-B: doctor "old" branch is documented as defensive for hypothetical
  future code paths. Pre-existing classification design — not a bug.
- A2 P2: lcm_describe schema-validation gate runs at MCP layer; harness
  bypasses. Not a production issue.
- A2 P3: lcm_expand_query opaque "Delegated expansion query failed"
  when LLM unconfigured. Pre-existing; cycle-3 ergonomics.

PR Martian-Engineering#613 STATUS
==============
Branch tip: feat/lcm-v4.1-omnibus @ this commit
Tests: 1345/1345 (no regressions across 6 waves)
QA: 30/30 + 10/10
Audit: 6 waves, 47 unique findings closed (7 P0, 30 P1, 27+ P2, 16+ P3)

Ready for re-review and merge.
100yenadmin pushed a commit that referenced this pull request May 6, 2026
… + 15 P1 closed

After Eva's correct push for full-PR re-audits (Waves 5-6 were focused
on diffs only and missed regressions in untouched surfaces), Wave-7 ran
22 parallel Opus 1M-context agents at ~1k LOC each across the full
~22K LOC production codebase. Surfaced 7 actionable P0s + ~30 P1s +
~25 P2s + ~15 P3s. (1 P0 from Auditor #17 was confused — was reading
a stale clone path; ignored.)

P0 — DATA / SECURITY / CORRECTNESS (7 closed)
=============================================

Auditor #14 P0-1 (CRITICAL — security): /lcm purge --apply lacked any
operator-session gate. The purge.ts module docstring explicitly
required "callers MUST gate via deps.isOperatorSession() or equivalent"
but the lcm-command.ts dispatch site at line 2626 wired runPurge with
ZERO check. Any agent that could issue /lcm slash commands could
purge another session's data — including Eva's primary thread via
--allow-main-session. Fix: gate the entire `case "purge":` dispatch
on `ctx.senderIsOwner` (the OpenClaw plugin SDK owner-only flag).
Both dry-run preview AND --apply require owner; preview is gated
because it leaks which leaves match the criteria.

Auditor #14 P0-2 (data loss): Purge cascade orphaned shared messages.
The UPDATE messages SET suppressed_at WHERE message_id IN (SELECT
... FROM summary_messages WHERE summary_id IN (...)) silently
suppressed messages even when they were referenced by NON-purged
leaves. assemble() filters on suppressed_at IS NULL → those
non-purged leaves lost their underlying message content invisibly.
Fix: added NOT EXISTS predicate that requires every other
referencing summary to ALSO be in the purge set OR already suppressed
before suppressing the message.

Auditor #6 P0 (cache pollution): sessionKeyForCache fell back to "" in
period mode when targetSummary was null AND input.sessionKey was
empty. The cache UNIQUE constraint then collapsed multiple users'
caches together — caller A's synthesis would surface in caller B's
loser-path SELECT. Fix: 4-tier fallback chain — targetSummary's key
→ input.sessionKey → conversationIds[0]'s session_key (looked up
from conversations table) → "agent:main:main" as last-resort default.

Auditor #9 P0-2: expandMessages did not honor the W4 budget=0
expansion-block; only expandChildren did. A delegated caller with
grant=0 calling expandMessages=true got full message content despite
the documented "expansion is blocked" assertion. Fix: identical
budgetExhausted gate added to the expandMessages branch.

Auditor #12 P0-A: Per-row SAVEPOINT MISSING in entity-coreference
batch tx. A single bad surface (FK violation, encoding issue, CHECK
failure) ROLLBACKed the WHOLE LEAF — discarding all valid mentions
already inserted AND failing to bump attempts (the dead-letter gate),
producing an infinite-retry loop on poison surfaces. Fix: each entity
surface now gets its own SAVEPOINT inside the batch tx. Per-row
failure rolls back JUST that surface; siblings + queue UPDATE survive.
Failures recorded in itemDetail.error per-index for operator visibility.

Auditor #9 P0-1: describe()'s "raw count" header LIED. It labeled
`s.childIds.length` as "raw candidate(s) before suppression filter"
but childIds was already suppression-filtered upstream by
getSummaryChildren default. Agents reading the header believed they
were seeing pre-filter counts. Fix: re-query the actual raw count via
a cheap COUNT(*) on summary_parents and emit honest "X of Y raw"
phrasing. When all children suppressed, distinguishes from "no
children" (terminal node) — was previously indistinguishable.

Auditor #19 P0: scripts/v41-synthesize-around-smoke.mjs still used
copyFileSync against the live WAL DB (W4 fixed v41-live-db-harness.mjs
+ preflight but missed this third script). Mid-checkpoint copies
produce malformed snapshots. Fix: VACUUM INTO atomic snapshot.

P1 — HIGH IMPACT (15 closed)
=============================

- Auditor #1 P1: searchLikeCjk used `new Date()` instead of
  parseUtcTimestamp → CJK fallback timestamps offset by host's local
  TZ. Other 4 search paths used parseUtcTimestamp; CJK was the outlier.
- Auditor #2 P1: Voyage responseBody privacy. W4 fixed only the 400
  path; 401/403/429/5xx/4xx-other still attached raw bodyText to the
  exception. Same Sentry/log-capture vector. Fix: route ALL non-200
  responseBody through summarizeBody for parity.
- Auditor #4/13 P1: tickExtraction ignored result.lockLostMidTick. W4
  added the field but the wrapper returned `lockAcquired: true`
  regardless. Now flips to false when heartbeat reported lock-loss
  mid-tick → autostart can detect + back off.
- Auditor #5 P1.1: best-of-N used Promise.all → one failed candidate
  threw away successful peers' work. Fix: Promise.allSettled. Throw
  only if ALL fail; judge picks among survivors.
- Auditor #5 P1.2: best-of-N with N=1 still ran judge — judge prompt
  expects 0..N-1 indexed candidates; many models emit 1-indexed and
  trip judge_failure. Fix: skip judge when only 1 candidate survived.
- Auditor #6 P1: parsePeriodShortcut regex over-accepted undocumented
  variants (last-3day, last-3-d). Fix: tightened to /^last-(\d+)d$|
  ^last-(\d+)-days$/ matching only documented forms.
- Auditor #8 P1-3: sort silent override. Agent passing sort=relevance
  with mode=regex got recency without warning. Fix: details now
  surfaces sortIgnored: true + requestedSort/effectiveSort.
- Auditor #8 P1-2: kFts/kSemantic over-fetch was max(limit, 50). At
  limit=200, rerank had ZERO headroom. Fix: 3× limit, floored at 50,
  capped at 500 (Voyage rerank budget).
- Auditor #21 + #8 P1-6: hybrid confidenceBand thresholds reuse
  cosine calibration on rerank scores (different scale). Fix: emit
  confidenceBandSource: "cosine" | "rerank" so callers know which
  signal drove the band.
- Auditor #12 P1-A: extractor placeholder pre-scan (W4 promised but
  never implemented). Fix: refuse extraction if leaf content contains
  XML envelope-like patterns (defense-in-depth against injection).
- Auditor #12 P1-E: dead-letter UPDATE failure left attempts at 0 →
  infinite retry. Fix: try second simpler bump-only UPDATE if the
  first (with last_error) fails.
- Auditor #18 P1: promptAwareEviction violates "structural-only"
  invariant. Fix: documented as opt-in with WARNING comment in
  config.ts that flagging it on breaks deterministic replay.
- Auditor #20 P1-3: README synthesize_around description was
  anchor-required-only — period mode (the lcm_recent replacement)
  not mentioned. Fix: 3-mode breakdown.
- Auditor #20 P1-4: THE_FIVE_QUESTIONS stale prose declared
  "themes/procedures/entities" all live. Themes + procedures were
  CUT (preserved in Martian-Engineering#616). Fix: explicit coverage status note.

VERIFICATION
============
- 1345/1345 unit tests passing (no regressions)
- QA runner full: 30/30 pass
- QA runner adversarial: 10/10 pass (not re-run; W6 baseline)
- Total cost ~$0.11 per full QA run

DEFERRED (acknowledged)
========================
- A14 P1: lcm_purge_audit table — needs schema migration; defer to
  cycle-3. Workaround: purge_session_id is returned + suppress_reason
  is recorded per leaf row.
- A18 P1: summarizeWithEscalation silent over-cap truncation —
  separate from W4 fallback marker fix; cycle-3 ergonomics.
- A8 P1-5: details.hits[] shape drift across 5 grep modes — by-design
  difference (regex/full_text are aggregates; hybrid/semantic/verbatim
  are per-row). Documented in agent-tools.md.
- A8 P1-4: verbatim recency-only ordering — by-design (citation use
  case prioritizes "what was said most recently").
- A10 P1-01: lcm_expand 24-day legacy timeout — sub-agent-only path,
  bounded by grant TTL.
- A10 P1-06: runExpand `?? 0` fallthrough — multi-conv grant path
  not exercised by lcm_expand_query (always single-conv).
- Various P2/P3 cosmetic items.
100yenadmin pushed a commit that referenced this pull request May 6, 2026
…0 + 9 P1 closed + 4 new regression tests

Wave-9 was the first audit cycle to give every agent FULL FILE context
(not just diffs) plus cross-cutting checklists tailored to their slice,
plus all prior wave findings as known-closed reference. Eva's directive:
"agents need ENOUGH CONTEXT not to introduce new issues while fixing
minor ones." Wave-9 also added a TS-strict closure pass (separate
commit 11f10a6) that brought PR-introduced TS errors from 30 → 0.

11 agents (slicing by responsibility, ~14.7k LOC src + 12.5k LOC tests
+ 2.2k LOC scripts):
  #1 Lossless core      — engine, assembler, retrieval, summarize, compaction
  #2 Migration + schema — db/migration, all migration tests
  #3 Storage layer      — summary-store, conversation-store
  #4 Search tools       — lcm_grep, lcm_semantic_recall, hybrid, semantic
  #5 Drilldown tools    — lcm_describe, lcm_expand, lcm_expand_query
  #6 Entity + extraction — lcm_get_entity, lcm_search_entities, coreference
  #7 Synthesis          — synthesize_around, dispatch, prompt-registry, seed
  #8 Voyage stack       — voyage/client, embeddings/store/backfill/semantic
  #9 Worker + concurrency — concurrency/*, autostarts, worker-orchestrator
  #10 Operator surface  — purge, health, reconcile, eval-runner, plugin
  #11 Scripts/QA-runner — coverage-gap audit Eva caught after launch

Findings: 1 P0 + 13 P1 + 22 P2 + 42 P3 = ~77 unique
(Agent #2 P2 and Agent #7 P1 converged on same `{{date_range}}` bug.)

This commit closes the P0 + 9 of 13 P1s + adds 4 regression tests.
Remaining P1s + all P2/P3 are documented in PR comment for follow-up.

P0 (CLOSED) — Owner gate parity (Agent #10):
- /lcm reconcile-session-keys --apply lacked senderIsOwner (Wave-7
  P0-1 had only added it to /lcm purge). Cross-session data theft
  vector: non-owner agent could re-key Eva's primary thread into an
  attacker bucket via --allow-main-session.
- /lcm worker tick embedding-backfill same gap (lower-impact:
  DoS-by-billing on the operator's Voyage account).
- Both fixed: same gate pattern as case "purge" applied to both.
- 3 new regression tests pin the gate behavior so future refactors
  can't silently regress.

P1 fixes (9 of 13):

P1.1 (Agent #5) — Citation-fabrication count threaded through
ExpandQueryReply. Wave-4+W6+W8 chain validated citedIds internally
(rejected fabricated IDs against summaries table) but
buildExpandQueryReply silently dropped the counts. Agent now sees
citedIdsRejectedAsFabricated + citedIdsExceededValidationCap in the
JSON reply (omitted when zero, summed across buckets in multi-conv
path).

P1.2 (Agent #5) — lcm_describe expandChildren/expandMessages now
consumes the grant token budget. Previously the budget was CHECKED
(budgetExhausted detection) but never DECREMENTED. With 50 children
+ 50 messages × ~2K tokens each = ~100K tokens delivered per call
without grant cap touching. Now sums consumed tokens and calls
authManager.consumeTokenBudget() for sub-agent sessions. Closes the
unbudgeted side-channel that defeated the W4/W6 expansion budget.

P1.3 (Agent #4) — lcm_grep --mode semantic VoyageError contract
parity. Previously caught only `auth` and SemanticSearchUnavailable;
let rate_limit/server_error/network/bad_request/unexpected propagate
as unhandled tool errors. lcm_semantic_recall correctly catches all
VoyageError kinds. Now mirrored — both surfaces routed for Question B
have identical error contract.

P1.4 (Agent #4) — lcm_grep --mode verbatim CJK fallback. messages_fts
uses tokenize='porter unicode61' which can't segment CJK ideographs
— MATCH on 中文 returned 0 rows WITHOUT throwing, so the
exception-driven LIKE fallback never fired. Now containsCjk(pattern)
detected at JS layer, routes directly to LIKE substring match
(skipping FTS join entirely). 1 new regression test covers Chinese
characters.

P1.5 (Agent #10) — reconcileSessionKeys TOCTOU race. affectedConvs
snapshot was taken OUTSIDE BEGIN IMMEDIATE; concurrent INSERT/UPDATE
between snapshot and tx-acquire could be UPDATE-moved without an
audit row, silently dropping it → loss-of-undo on a destructive op.
Same pattern as Wave-8 P1's runSoftPurgeAtomic fix. Refactored:
active-conflict pre-check + affectedConvs SELECT + UPDATEs all run
inside the same BEGIN IMMEDIATE.

P1.6 (Agent #10) — runRecallEval setTimeout leak. Promise.race
spawned a timer that was never cleared on adapter resolve. N=100
queries × 30s = 30s tail-latency floor + event-loop liveness held
open (process never exits in scripts). Added try/finally with
clearTimeout.

P1.8 (Agent #1) — Compaction fallback marker regression. Wave-4 P0
fix in summarize.ts tagged fallback content with "[LCM fallback
summary - model unavailable]" — but because the marker adds ~25
tokens, the resulting summary is LARGER than the source, so
summarizeWithEscalation rejected it as "didn't compress" and fell
through to compaction.ts's OWN buildDeterministicFallback which
emitted raw truncated content with NO marker, silently undoing the
W4 fix for any source <= max(targetTokens*4, 256) chars (i.e. most
leaves under LLM outage). Fix: prepend the same marker in
compaction.ts's fallback. Empty-source path tagged for parity.

P1.9 (Agent #2 + #7 convergence) — {{date_range}} placeholder
orphaned in seed prompts vs renderer. dispatch.renderPrompt only
substituted source_text/tier/memory_type. Seeded daily/weekly/
monthly templates used {{date_range}} literally; SynthesizeRequest
had no dateRange field. Currently latent (synthesize_around clamps
to custom/filtered) but becomes P0 the moment a daily/weekly/monthly
synthesis worker wires up. Same class as Final.review.3 Loop 4 Bug
4.2. Fix: dropped {{date_range}} from seeded templates (use
"from a single day/week/month" phrasing instead). Caller can bake
explicit ranges into sourceText if needed.

P1.10-P1.13 (Agent #11) — QA harness coverage gaps:

P1.10 — process.chdir("/tmp/lossless-claw-upstream") hardcoded made
the QA harness unrunnable anywhere except that exact path. Replaced
with a sentinel-file existence check that errors fast with a clear
"run from repo root" message.

P1.11 — adv-lcm-expand-query-smoke was vacuous: predicate returned
null unconditionally, args omitted required `prompt` field. Now
exercises full dispatch path with real prompt + asserts response
shape (answer + citedIds, or graceful LLM-unavailable error).

P1.12 — Period mode (lcm_recent replacement, most reviewer-debated
capability) had ZERO harness coverage. Added 2 new test cases:
period='yesterday' and period='last-7d' (covers the W7-tightened
hyphenated parser).

P1.13 — lcm_grep regex/full_text modes had ZERO harness coverage
(2 of 5 documented modes). Added 2 new test cases asserting the
regex/full_text response shape (totalMatches/messageCount/
summaryCount, not details.hits which is hybrid-only).

Verifications:
- npx tsc --noEmit → 739 errors (exactly matches origin/main baseline;
  ZERO PR-introduced TS errors)
- npx vitest run → 1353/1353 passing (1349 baseline + 3 owner-gate
  + 1 CJK regression tests)
- All Wave-9 fixes verified at code level on real file paths

Deferred P1s (4 of 13) — handled in follow-up commits / cycle-3:

- P1.7: TOCTOU between affectedConvs and active-conflict pre-check
  is now closed (folded into P1.5 fix above).
- Agent #5 P2 multi-bucket DEFAULT_MAX_CONVERSATION_BUCKETS=3 silent
  drop is documented but deferred (ergonomic, not safety).
- Agent #4 cosineSimilarity not clamped in hybrid mode: trivial 2-line
  fix but not safety.
- Agent #5 dead `runDelegatedExpansionLoop` in lcm_expand: cleanup
  task, no behavior change.

Pattern observation: Wave-9's full-file-context approach paid off —
caught the same class of bug (missing owner gate) on the SISTER case
of a previously-fixed P0, which a narrow-diff audit could not have
spotted. Future audits should keep this approach.
100yenadmin pushed a commit that referenced this pull request May 6, 2026
… 4 sub-agent test layers + 8 source bugs closed

A separate reviewer raised 12 findings on PR Martian-Engineering#613 with the strategic bar
"don't just make the findings disappear; make the PR truthful under real
operator scenarios." User correctly noted "wasn't sure if verified" so
I verified each before fixing. Verification result: 12-for-12 real bugs.

Combined with 4 parallel test-quality sub-agents addressing antipatterns
A8 (concurrency) + A9 (schema drift) + A1/A4 (adversarial scenarios +
fixture-test circularity) + A4-at-scale (stress fixture).

# Reviewer findings (all 12 closed)

## P1 (5)

- **#1 Period synthesis timezone** (src/tools/lcm-synthesize-around-tool.ts):
  parsePeriodShortcut anchored "today/yesterday/this-week/last-week/
  this-month/last-month" at UTC midnight. A Bangkok operator (UTC+7) at
  02:00 local asking "yesterday" got UTC-yesterday — ~17 hours off.
  Operator-trust violation. Now uses Intl.DateTimeFormat to compute
  local-day boundaries in lcm.timezone (configured IANA TZ); samples
  the offset at local noon to avoid DST-fold ambiguity. Relative forms
  (last-Nh, last-Nd) stay UTC-anchored (now-minus-N, not day-anchored).

- **#2 Synthesis cache key** (src/db/migration.ts +
  src/tools/lcm-synthesize-around-tool.ts): UNIQUE index keyed only on
  (session_key, range_start, range_end, leaf_fingerprint, grep_filter).
  Two correctness bugs: (a) tier='custom' then tier='filtered' for same
  range/leaves silently returned wrong-tier cached text, (b) registerPrompt
  changing the active prompt left cache serving stale text from the old
  prompt. Now includes tier_label + prompt_id in both the UNIQUE index
  and the lookup SELECT. Cache is rebuildable so wiping under the new key
  is safe.

- **#4 /lcm eval owner gate** (src/plugin/lcm-command.ts): /lcm eval
  mutates lcm_eval_run + lcm_eval_query_result tables AND can use Voyage
  in hybrid mode (small but non-zero quota cost). Wave-9 Agent #10 had
  classified it as READ_ONLY — the reviewer correctly challenged that
  classification. Now gated on senderIsOwner and added to the
  authorization-invariant test's DESTRUCTIVE_OPERATOR_CASES list.

- **#5 Voyage rerank token budget** (src/embeddings/hybrid-search.ts):
  rerank sent ALL candidates' full content with no enforcement of the
  ~600K-token cap. Realistic queries with many large condensed summaries
  hit Voyage 400 → silent RRF degradation, losing the +52.5pp paraphrastic
  recall lift. Now packs candidates into rerank input cumulatively until
  85% of MAX_TOKENS_PER_RERANK_CALL, dropping tail when over budget.
  Surfaces rerankPackTruncated + rerankPackedCount in HybridSearchResult.

- **#6 lcm_describe base content not charged**
  (src/tools/lcm-describe-tool.ts): Wave-9 P1.2 fix added
  consumeTokenBudget for expandedChildren + expandedMessages but skipped
  the base summary's s.content (which lines.push()es ALL of it). A
  sub-agent could lcm_describe a 30K-token condensed summary with NO
  expansion flags and drain context for free. Now charges base s.tokenCount
  too.

## P2 (5)

- **#3 Suppressed entity leakage** (src/tools/lcm-get-entity-tool.ts +
  src/tools/lcm-search-entities-tool.ts): when ALL mentions of an entity
  were suppressed via /lcm purge, the entity row in lcm_entities still
  leaked canonical_text + alternate_surfaces + metadata via both tools.
  The reviewer's framing: "suppression means invisible to agents, period."
  Both tools now require at least one unsuppressed mention via EXISTS
  guard. The "not found" branch now covers both "no such entity" AND
  "all mentions suppressed" indistinguishably (so an attacker can't
  infer entity existence). Updated test fixtures' insertEntity helpers
  to auto-create a default visible mention; tests that explicitly want
  the all-suppressed case opt out via noDefaultMention: true.

- **#7 Pending-extractions count** (src/extraction/entity-coreference.ts):
  countPendingExtractions filtered only on (kind, completed_at IS NULL),
  but runCoreferenceTick's selector ALSO requires (attempts < 5,
  summaries.suppressed_at IS NULL). Mismatch caused autostart to spin
  forever on rows the tick would never select. Predicate now exactly
  matches the selector.

- **#8 QA runner period coverage + exit semantics** (scripts/v41-qa-runner.mjs):
  period test cases I added in Wave-9 P1.12 omitted window_kind="period"
  (required by the tool), so they only hit schema-validation early-return
  and the regex match on 'period' made them trivially pass. Added the
  required field. Plus failedImportant had no exit branch — runner exited
  0 on any "important" failure, advisory-only. Added exit code 1 for
  important failures so the runner can act as a release gate.

- **#9 sqlite-vec install honesty** (package.json + semantic-infra-init.ts):
  sqlite-vec wasn't in any dependencies block, init log was log.info
  (low visibility), and PR_DESCRIPTION emphasized VOYAGE_API_KEY alone.
  Added to optionalDependencies; bumped log to log.warn with explicit
  install instructions + clear "what becomes unavailable" message.

- **#10 Backfill complete message lies** (src/plugin/lcm-command.ts):
  countBackfillPending excludes leaves with token_count >
  MAX_TOKENS_PER_EMBED_DOC, so an over-cap leaf was neither pending nor
  backfilled. Worker-tick output printed "✅ Backfill complete" even when
  over-cap leaves remained unembedded. Added countOverCapPendingForBackfill
  helper; completion message now distinguishes "in-range complete +
  over-cap remain" from full coverage.

## P3 (2)

- **#11 lcm_synthesize_around description** (src/tools/lcm-synthesize-around-tool.ts):
  agent-tool description still said "Two modes" (time + semantic) while
  schema declared three. Rewrote description + JSDoc to mention all three
  (period, time, semantic) and explicitly call out 'period' as the
  lcm_recent replacement / "what did we work on yesterday" surface.

- **#12 NUL byte in source** (src/tools/lcm-synthesize-around-tool.ts:331):
  fingerprintLeaves used a literal NUL byte (\x00) as a hashing separator,
  making the file binary to grep. Replaced with the escape sequence "\0"
  (functionally identical at runtime, readable in source). File is now
  searchable.

# Sub-agent test layers (4 in parallel)

## Sub-agent #1 — Concurrency / TOCTOU (test/v41-concurrency-invariants.test.ts, ~1044 LOC, 8 tests)

Worker-thread-based parallel-writer harness reproduces and pins
race-condition fixes: reconcileSessionKeys race (Wave-9 P1.5),
runSoftPurgeAtomic race (Wave-8 P1), worker-lock acquire (5-way),
heartbeat-during-LLM-call (Wave-9 Agent #8 P2), recordEmbedding
DELETE-before-INSERT atomicity. Verified regression-detection by
simulating pre-fix code. 0 new bugs found.

## Sub-agent #2 — Schema/placeholder drift (test/v41-schema-drift-invariants.test.ts, ~654 LOC, 19 tests)

Static-analysis tests via readFileSync + regex. Catches: placeholder
drift in seeded prompts vs renderer (Wave-9 P1.9 class), tier_label
CHECK constraint coverage vs TS union (Final.review.3 Bug 4.4 class),
manifest-vs-registered-tool drift (Wave-9 vapor-tools class),
parser/handler symmetry, FK ON-DELETE explicitness. **Found 3 P3 FK
drift bugs** — 3 declarations missing explicit ON DELETE clauses.
Closed in this commit (lcm_synthesis_cache.prompt_id,
lcm_synthesis_audit.prompt_id, lcm_embedding_meta.embedding_model →
all now `ON DELETE RESTRICT`).

## Sub-agent #3 — Adversarial scenarios + fixture-test circularity audit (test/v41-adversarial-scenarios.test.ts, ~1149 LOC, 37 tests)

Audit of original 25 scenarios: 16/26 strong, 9/26 weak ("only
totalMatches > 0"), 1 sentinel. Strengthened 6 weak tests in
v41-five-questions.test.ts (B1-B5, E2) to assert specific summary
IDs. **Found 1 real fixture bug**: summaries_fts insert used `rowid`
but schema declares `(summary_id UNINDEXED, content)` — original
B1-B5 tests "passed" only because they matched at the messages
layer, never actually exercising summary FTS. Fixed in fixture; the
strengthened B1-B5 tests now actually exercise summary FTS. 37 hard
adversarial scenarios spanning paraphrase, ambiguity/ranking, compound
queries, negative queries, content injection (placeholder/XML/script/
SQL-injection), ranking sensitivity, cross-tool composition,
suppression boundary.

## Sub-agent #4 — Stress fixture (test/fixtures/v41-stress-corpus.ts + test/v41-stress-fixture.test.ts, ~898 LOC, 11 tests)

Deterministic generator for 1500-2500 leaves with realistic distribution
(30% last-7-days, dense days with 100+ leaves, 5-10% suppressed, 5% CJK,
near-duplicates, 5 adversarial-content leaves). 11 stress tests cover
build smoke, determinism, distribution, dense-day query, suppression
cascade, FTS5 perf, vec0 KNN (graceful no-op when vec0 unavailable),
adversarial-content non-breaking, near-duplicate handling, recency floor.

# Wave-10 reviewer regression coverage (test/v41-wave10-reviewer-regressions.test.ts, 6 tests)

Pins fixes for #2 (cache UNIQUE index w/ tier+prompt), #3 (suppressed
entity invisibility), #7 (pending count predicate), #10 (over-cap
counting). #1 has its own dedicated v41-period-timezone.test.ts (8
tests). #4 covered by extending v41-authorization-invariants.test.ts
DESTRUCTIVE_OPERATOR_CASES.

# Verification

- **1490/1490 tests passing** (1401 pre-Wave-10 + 89 new from this commit)
- **677 TS errors** (FEWER than the 739 main baseline — type-tightening
  fixes cascaded from the source changes)
- 4 sub-agent test files all green
- 6 reviewer-regression tests all green
- Authorization invariant test now covers `eval` → catches future
  removal of the gate

# What's NOT in this commit (future work)

- Mutation testing CI integration (stryker is too slow for per-PR;
  config exists for ad-hoc invocation)
- Wave-1-9 antipattern tabulation update with Wave-10 findings
100yenadmin pushed a commit that referenced this pull request May 6, 2026
…ed 12/12 real)

Fresh re-audit at 37e2b71 found 12 issues; 11 closed in this commit, 1
documented as known limitation. Reviewer was 12-for-12 real (Wave-10
was also 12-for-12; reviewer track record: 24-for-24).

# CI blockers

- **#1 (P1)** Auth invariant test hardcoded `/tmp/lossless-claw-upstream`
  path. CI failed because that path doesn't exist on GitHub runners;
  local runs accidentally succeeded by reading whatever stale checkout
  was at that path. Now resolves via `import.meta.url` →
  `__dirname/../src/plugin/lcm-command.ts`. Works in any worktree.

- **#10 (P2)** `pnpm-lock.yaml` was stale after the Wave-10
  `optionalDependencies` addition. Regenerated via `pnpm install
  --lockfile-only`; verified `pnpm install --frozen-lockfile` succeeds.

# Security parity

- **#2 (P1)** `/lcm doctor apply` and `/lcm doctor clean apply` lacked
  `senderIsOwner` gate. Wave-9 Agent #10 had classified the doctor
  cases as READ_ONLY, but the `apply` flag inside dispatches to the
  summarizer (cost) AND mutates summaries (state) for `doctor apply`,
  and DELETEs cleaner matches for `doctor clean apply`. Mirror the
  purge / reconcile / worker-tick / eval gate pattern. Read-only
  variants (no `--apply`) stay open.

  Plus updated `test/lcm-command.test.ts`'s `createCommandContext`
  helper to default `senderIsOwner: true` so existing tests for the
  doctor mutating paths continue passing — Wave-9 negative tests
  still explicitly pass `senderIsOwner: false` via overrides.

  Plus added 4 new tests to `v41-authorization-invariants.test.ts`
  pinning the Wave-11 doctor-apply gate behavior (apply-rejected,
  read-only-allowed for both `doctor` and `doctor clean`).

- **#5 (P1)** `lcm_describe` early-budget-gate. The Wave-10 fix charged
  base summary tokens against the grant AFTER emitting `s.content`.
  For a sub-agent at zero remaining budget, the content was already
  disclosed before accounting could prevent it. Added an EARLY gate:
  if delegated session AND base summary tokens > remaining grant,
  redact `s.content` with a clear "[REDACTED — base summary content
  is N tokens but grant has only M remaining]" message and skip the
  charge. Closes the disclosure-before-accounting path.

# Correctness

- **#3 (P1)** Timezone fractional offsets + DST. Wave-10's "sample
  offset at noon" approach broke on:
    - Half-hour zones: Asia/Kolkata (UTC+5:30) → showed +5 not +5:30
    - Quarter-hour zones: Asia/Kathmandu (UTC+5:45)
    - DST transition days: LA spring-forward 2026-03-08 → noon is in
      PDT (-7) but local midnight was in PST (-8); my function used
      the noon offset for the whole day → wrong by 1 hour
  Replaced with iterative converge-to-midnight algorithm:
    1. Format `at` in target tz to get y/m/d
    2. Probe = naive `Date.UTC(y, m-1, d, 0, 0, 0)`
    3. Format probe in target tz; compute delta from target midnight
    4. Adjust probe; repeat until delta=0 (typically 1-2 iters)
  Handles all IANA timezones, DST transitions, and arbitrary offsets.

  Added 3 new regression tests:
    - Asia/Kolkata 'yesterday' (UTC+5:30) — half-hour offset
    - Asia/Kathmandu 'today' (UTC+5:45) — quarter-hour offset
    - America/Los_Angeles 2026-03-08 — spring-forward day, asserting
      'today' duration is exactly 23h

- **#6 (P1)** Hybrid rerank now skips individually oversized
  candidates instead of bailing. Pre-fix: when the FIRST candidate
  exceeded the 510K-token (85% of 600K) rerank budget, the packer
  set `rerankPacked=[]` and broke out, disabling rerank for the
  whole result set. Now: oversized candidates are individually
  skipped (counted in `rerankPackSkippedOversized`) and packing
  continues with later candidates that fit. Result: a single huge
  FTS hit no longer takes down the whole rerank.

- **#7 (P1)** Voyage `output_dimension` not forwarded. Configurable
  embedding dimensions (`LCM_EMBEDDING_DIM=2048` registers a 2048-dim
  profile in `lcm_embedding_profile`) but `embedTexts()` never sent
  `output_dimension` to Voyage, so Voyage returned its default (1024).
  vec0 INSERT then failed with dim mismatch on the per-model table.
  Added `outputDimension?: number` to `VoyageEmbedOptions`; forwarded
  via backfill (`opts.voyageOutputDimension`) and semantic-search
  query embed (`active.dim`). Default unchanged (omit → Voyage 1024).

# Documentation accuracy

- **#4 (P1)** Synthesis dispatch model claim. Tool description said
  "per-tier dispatch (haiku/sonnet/opus/thinking)" but actual LLM call
  routes through the configured summarizer chain (which ignores
  `args.model`). Source code already had honest comment in
  `buildLlmCallFromSummarizer` ("the summarizer wrapper ignores the
  dispatch-supplied model"); the tool description and PR description
  overclaimed. Updated tool description to be accurate: dispatch
  records the per-tier model name in the audit table, but the
  actual LLM call uses the operator's configured summarizer chain.

# Polish

- **#9 (P2)** Health archive filter. `readActiveProfile` selected on
  `active = 1` alone, ignoring `archive_after IS NOT NULL`. Semantic
  retrieval correctly filters archived; health was reporting a
  profile semantic search would not actually use during model cutover.
  Now matches: `WHERE active = 1 AND archive_after IS NULL`.

- **#11 (P2)** Changeset rewritten. Old changeset only mentioned
  session-family recall. New changeset documents the full v4.1
  release surface: 8 agent tools (with new modes), 2 worker autostarts,
  9 operator commands (with owner-gating), schema changes, sqlite-vec
  optionalDependency, configuration env vars, and what was cut to Martian-Engineering#616.

- **#12 (P3)** Stale entity-search docblock. The header comment said
  "entities with all-suppressed mentions can still appear here";
  Wave-10 added the EXISTS guard so they no longer can. Updated
  comment to reflect the actual filter behavior.

# Known limitation (deferred)

- **#8 (P2)** Cache key still ignores resolved model. Adding `model_used`
  to the UNIQUE index doesn't help because model resolution is dynamic
  (the summarizer chain picks at call time, not before INSERT). The
  proper fix is invalidate-on-mismatch at cache-hit time, which is a
  larger refactor. Documented in the entry above + tracked for follow-up.

# Verification

- `npx vitest run`: **1513 / 1513 tests passing** (1502 → 1513;
  +11 new regression tests for Wave-11 fixes)
- `npx tsc --noEmit`: **677 errors** (still below 739 main baseline;
  no PR-introduced TS errors)
- `pnpm install --frozen-lockfile --ignore-scripts --lockfile-only`:
  **succeeds** (was failing pre-fix with ERR_PNPM_OUTDATED_LOCKFILE)
- Authorization invariant test: now resolves the source path relative
  to test file via `__dirname` — works in any checkout location
100yenadmin pushed a commit that referenced this pull request May 7, 2026
…pattern

Wire #2 of 3 for the agent context-management architecture (Wave-14).

# What this lands

Tools that could push context over budget now run a pre-call gate
BEFORE doing work: estimate the result size; if (currentTokens +
estimated) / tokenBudget > REFUSAL_THRESHOLD (0.92), return a
structured `{ok: false, needsCompact: true, ...}` payload instead.
Agent reads, calls lcm_compact, retries — the natural negotiation
pattern.

Without this layer, an agent at 78% context calling
`lcm_describe expandMessages=true expandMessagesLimit=20` (estimated
13K tokens) lands at ~84% AT BEST — but worst-case messages can
saturate the result-cap and push past 100%, causing
context_length_exceeded errors mid-turn.

# Tools wired

PRE-CHECK ENFORCED (7):
- lcm_grep (5 modes)
- lcm_semantic_recall
- lcm_describe (HIGHEST priority — biggest blow-up risk per Agent C)
- lcm_expand_query
- lcm_get_entity
- lcm_search_entities
- lcm_compact (small footprint; included for uniform agent UX)

NOT WIRED (intentionally — self-protecting or out-of-scope):
- lcm_synthesize_around: internal 50K source cap; prompt-bounded
  output ~2-3K. Per Agent B, can't blow context.
- lcm_expand: sub-agent-only, has its own grant ledger

# Files

NEW:
- `src/plugin/needs-compact-gate.ts` (~190 LOC) — REFUSAL_THRESHOLD
  constant (0.92 — calibrated against real DB), per-tool
  `estimateResultTokens(toolName, params)` formulas, the
  `evaluateNeedsCompactGate` core logic, and a `runWithTokenGate`
  wrapper helper that tools use to compose pre-check + post-call
  cache accumulation.
- `test/v41-needs-compact-gate.test.ts` (~120 LOC) — 19 tests covering
  per-tool estimator math, refusal logic, suggested-action narrowing,
  bypass-on-missing-telemetry, and threshold boundary cases.

EDITED (each ~5-10 LOC of changes):
- src/tools/lcm-grep-tool.ts — gate at top of execute, tap on returns
- src/tools/lcm-describe-tool.ts — gate + tap on final return
- src/tools/lcm-semantic-recall-tool.ts — runWithTokenGate wrapper
- src/tools/lcm-expand-query-tool.ts — wrapper
- src/tools/lcm-get-entity-tool.ts — wrapper
- src/tools/lcm-search-entities-tool.ts — wrapper
- src/plugin/index.ts — pass `getRuntimeContext` to all 7 tool factories
- src/plugin/token-state.ts — add `tapResultForTokenAccounting` helper

# How the agent experience works

```
Agent: lcm_describe id=sum_xxx expandMessages=true expandMessagesLimit=30

Tool gate:
  estimatedResultTokens = 10000 (capped)
  currentRatio = 0.78
  projectedRatio = (156000 + 10000) / 200000 = 0.83 → BELOW 0.92 → run normally

Agent: lcm_describe id=sum_yyy expandMessages=true expandMessagesLimit=30

Tool gate:
  currentRatio = 0.89  // accumulated from previous result
  projectedRatio = 0.94 → OVER 0.92 → REFUSE

Tool returns:
{
  ok: false,
  needsCompact: true,
  reason: "context-overflow-prevention",
  currentRatio: 0.89,
  estimatedResultTokens: 10000,
  projectedRatio: 0.94,
  note: "Serving this call would push context to 94% of budget...",
  suggested_actions: [
    "lcm_compact then retry with same params",
    "retry with expandMessagesLimit=15"
  ]
}

Agent: reads, calls lcm_compact, retries. Now at 70% — call succeeds.
```

# Threshold (0.92) calibration

Wave-14 Agent A sampled Eva's live DB (3,904 leaves, 414 condensed,
315K messages). Per-tool result hard cap is 10K tokens
(MAX_RESULT_CHARS / 4). With 200K context:
  0.95 cushion → 10K headroom = zero margin (one capped call → 100%)
  0.92 cushion → 16K headroom = one capped call + agent response
  Lower thresholds → over-refusal on safe calls

# Per-tool estimator confidence

(Per Wave-14 Agent C calibration against actual format strings)
- lcm_grep regex/full_text/hybrid/semantic — 90%
- lcm_grep verbatim — 60% (variable per-message size)
- lcm_semantic_recall — 90%
- lcm_describe (no expand) — 70%
- lcm_describe (expand flags) — 60% (high subtree variance)
- lcm_get_entity / lcm_search_entities — 90%
- lcm_expand_query — 80%

Estimator capped at HARD_CAP_TOKENS (10K) regardless of natural
estimate — protects against under-estimation. Tools that return less
than estimated just have headroom; tools with bad estimates get
their natural cap protection.

# Verification

- 1592/1592 tests passing (1573 baseline + 19 new gate tests)
- 7/7 release-readiness preflight checks pass
- 330 TS errors (under 700 baseline; PR introduced none)

# What's next (Commit 3 of 3)

Synchronous compaction at critical pressure (`afterTurn` deferred-mode
drain runs sync at >0.85 currentRatio). System-level safety net
behind the agent-driven layers.
100yenadmin pushed a commit that referenced this pull request May 7, 2026
Wave-2 cross-cutting audit (4 parallel agents: token-state-integration,
schema/suppression, test/manifest/harness, fresh-eyes) caught 2 P0s + 1
P1 the per-file Wave-1 sweep missed.

P0 — token-state cache + accounting bus
- Post-compact stale cache: noteSuccessfulCompact() clears the entry
  on successful lcm_compact so the very next wrapped call re-bootstraps
  from the post-compact ground truth instead of refusing on the stale
  pre-compact snapshot. Without this, the agent could loop compact→
  refuse→compact until the 2/5min cap blocks further attempts.
- lcm_synthesize_around was OFF the runWithTokenGate accounting bus —
  the prior "self-protecting via 50K source cap" comment covered SOURCE
  input bounds, not OUTPUT (4K-8K markdown rollup flowed past the cache
  silently and drifted gate decisions low). Wrapped it; wired
  getRuntimeContext through registration in src/plugin/index.ts.

P1 — runWithTokenGate error path
- Tool throws (e.g. "LCM engine is unavailable" — present in 6+ tools
  + 13 throw sites in lcm_expand_query) skipped tapResultForTokenAccounting
  entirely. The runtime-serialized error message DOES cost tokens, so
  the cache drifted low by exactly the size of the error message every
  time. Added try/catch tap-then-rethrow.

Manifest drift fix
- registerTool comment placement: moved the W2A1 P0 #2 comment from
  between `=>` and `createLcmSynthesizeAroundTool` (where the manifest
  test's regex /=>\s*\{?\s*(?:return\s+)?(create...)/ couldn't match)
  to ABOVE the api.registerTool block. Re-runs 8/8 against the manifest.

Cosmetic
- README tool inventory: removed lcm_semantic_recall line, added
  lcm_compact + Wave-12 SA consolidation note (was: 9 listed minus 1
  removed but +1 missing = count cancels out, hidden bug).
- THE_FIVE_QUESTIONS.md: coverage 22/25 → 27/30 (post F1-F5 addition).
- 7 stale lcm_semantic_recall comment refs in src/embeddings/semantic-search.ts,
  src/engine.ts, src/store/summary-store.ts, src/tools/lcm-synthesize-around-tool.ts,
  test/v41-stress-fixture.test.ts, test/v41-tool-budget-guardrail.test.ts.

Verified
- 1587/1587 vitest passing (Wave-2 batch added regressions for the new
  noteSuccessfulCompact + try/catch tap behaviors).
- 35/35 QA harness against live-DB snapshot at \$0.11; F1/F4 args swap
  fix confirmed (F1 catalog browse, F4 PR filter).
100yenadmin pushed a commit that referenced this pull request May 7, 2026
…describe cap

W1A1 #2 — estimator HARD_CAP was hard-coded at 10_000 but the per-tool
char cap (LCM_TOOL_RESULT_TOKEN_BUDGET) is operator-tunable. With env
raised to 30K, tools could emit 30K but the gate's projection still
capped at 10K — needsCompact decisions drifted low (refusals missed
when they should fire) by up to 3×.

W1A8 #3 — lcm_describe was truly unbounded. Worst case (Wave-12
estimator already noted this in a code comment): a single
describe(condensed_id, expandChildren=true) on a wide condensed could
emit ~210K tokens (10K base + 20×10K children). Sub-agent grant ledger
(consumeTokenBudget, Wave-9 P1) protected delegated sessions; main-
agent calls had no per-tool char cap.

Single source of truth
- New src/plugin/result-budget.ts owns the env knob resolution. Exports:
  - MAX_RESULT_TOKENS — used by needs-compact-gate as HARD_CAP_TOKENS
  - MAX_RESULT_CHARS — used by tools for truncation
  - truncationNotice(reasonHint) — standard message format
- needs-compact-gate.ts pulls HARD_CAP from MAX_RESULT_TOKENS so the
  estimator and per-tool cap stay in lockstep.
- lcm-grep-tool.ts drops its local resolveMaxResultChars (now imports
  from result-budget). Behavior identical at the default; no change to
  truncation messages. (Existing per-grep messages preserved.)

lcm_describe truncation
- truncateLinesToCap helper at top of file. Mirrors lcm_grep's pattern:
  walk lines, accumulate char count (incl. join newlines), append the
  truncation notice and stop when over cap.
- Applied at both return sites (summary describe + file describe).
- details.manifest.truncated boolean flag exposed for programmatic
  callers; details.truncated on the file branch.

Tests (6 new, total 15 in suite)
- env=30000 → MAX_RESULT_TOKENS=30K, MAX_RESULT_CHARS=120K, estimator
  projection rises above 10_000 for verbatim mode (proves no longer
  pinned at the old hard-coded ceiling)
- env unset → 10_000 default
- env=100 → clamped UP to 2_000 floor (anti-misconfig)
- env=garbage → falls back to 10_000 default
- describe with 30K-char content + env=2000 → bounded under 10K + emits
  truncation marker
- describe with small content → emits full content, no truncation marker

Verified
- 1593/1593 vitest passing (was 1587, added 6 regression tests)
100yenadmin pushed a commit that referenced this pull request May 7, 2026
Wave-12 found 9 of 10 bugs that escaped 1593 tests. Each bug was
hidden by a distinct antipattern. This commit adds 4 new test layers
that pin the antipatterns so each bug class fails LOUDLY on regression.

A. Wiring/registration smoke (14 tests)
- test/v41-tool-wiring-smoke.test.ts
- For each tool documented as wrapped in needs-compact-gate.ts: assert
  the factory file calls runWithTokenGate(. For each documented-exempt
  tool: assert it does NOT call runWithTokenGate(. Catches the W2A1 P0
  bug class (synthesize_around silently dropped off the bus).
- For each registered tool in plugin/index.ts: assert getRuntimeContext
  is wired. Catches the half of the bug where the wrapper is present
  but not given runtime context.

B. Adversarial output bounds (3 tests)
- test/v41-adversarial-output-bounds.test.ts
- lcm_get_entity with 200 mentions × 1000-char surface_forms: bound check
- lcm_search_entities with 500 entities × 200-char canonical: bound check
- lcm_search_entities respects schema-bounded limit even with caller=500
- Catches W1A8 #3 sister cases (any tool that emits content without
  per-tool char cap).

C. Cross-module invariants (6 tests)
- test/v41-cross-module-invariants.test.ts
- estimateResultTokens projection ceiling === MAX_RESULT_TOKENS
  (caller-tunable env knob). Catches the W1A1 #2 bug class where two
  modules pin the same constant in isolation and drift apart.
- MAX_RESULT_CHARS = MAX_RESULT_TOKENS × 4 ratio
- REFUSAL_THRESHOLD calibration sanity vs MAX_RESULT_TOKENS
- Every src/tools/lcm-*-tool.ts factory referenced in plugin/index.ts
- summaryKinds reaches BOTH semantic and hybrid dispatch (W1A5 #1
  schema-vs-implementation drift)
- Sub-agent expansion-auth gate consistency (lcm_expand + lcm_describe
  both consult same manager)

D. QA-runner antipattern static scan (26 tests)
- test/v41-qa-runner-antipatterns.test.ts
- Extracts each `expect: (r) => {...}` closure from qa-runner.mjs.
  For tools with external deps (Voyage / LLM), assert the graceful-
  degradation regex check appears BEFORE bare `if (r.error) return`.
  Catches the W1 F5 bug class (inverted predicate making graceful
  branch dead code).
- Pins F1 has no entityType filter (catalog browse) AND F4 has
  entityType: pr_number (W1 F1/F4 args swap regression).

Verified
- 1642/1642 vitest passing (was 1593, +49 new tests; 0 bugs surfaced
  by the new layers — the patterns pin the existing post-Wave-12 fixes
  rather than uncovering new issues).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cache-aware compaction: timing inversion — decisions based on last-call status instead of cache expiry