Skip to content

feat(acp): emit agent_error session/update when LLM call fails after retries#26352

Open
truenorth-lj wants to merge 3 commits into
anomalyco:devfrom
truenorth-lj:feat/halt-emits-agent-error
Open

feat(acp): emit agent_error session/update when LLM call fails after retries#26352
truenorth-lj wants to merge 3 commits into
anomalyco:devfrom
truenorth-lj:feat/halt-emits-agent-error

Conversation

@truenorth-lj
Copy link
Copy Markdown

@truenorth-lj truenorth-lj commented May 8, 2026

Issue for this PR

Closes #24494
Closes #28453

Stacks on #26306, which adds the agent_error SessionUpdate kind and the LLMErrorPayload shape this PR uses. Diffs from that PR are included here until it merges.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement / new feature (non-breaking change which adds functionality)

What does this PR do?

When the retry policy in session/retry.ts is exhausted, session/processor.ts halt() records the failure internally but does not emit an ACP frame to the connected client. From an ACP client perspective, the turn can look silently stuck: no final error notification and no stopReason.

This PR adds a case "session.error": branch in packages/opencode/src/acp/agent.ts that translates the SDK error variant into an LLMErrorPayload and emits the new agent_error session/update kind with stopReason: "error".

The payload extraction prefers classification headers when present:

  • x-llm-error-type: budget | rate_limit | provider_unavailable | context_overflow | auth | unknown
  • x-llm-error-retryable: "true" / "false" (overrides type-derived retryable)
  • x-llm-error-reset-at: epoch ms - populates reset_at_epoch_ms
  • retry-after: seconds - populates retry_after_seconds

It falls back to status-code heuristics when headers are absent (401 -> auth, 5xx -> provider_unavailable, else -> unknown). ProviderAuthError and ContextOverflowError SDK variants are mapped by name.

Two session.error variants are intentionally not emitted as agent_error:

  • ContextOverflowError: compaction handles this path and the turn can continue on a smaller context window.
  • MessageAbortedError: this is the normal user-stop path triggered by session/cancel; the prompt RPC reports cancellation via stopReason: "cancelled", so emitting agent_error would make one user action look like both an error and a cancellation.

The existing event subscription already receives session.error events through the SDK event stream. No bus, SDK, or event-stream wiring changes are required.

How did you verify your code works?

packages/opencode/test/acp/halt-emits-agent-error.test.ts covers:

  • APIError classification headers -> typed payload fields preserved
  • fallback status-code classification for 503 and 401
  • ContextOverflowError and ProviderAuthError mapping
  • explicit retryability override from headers
  • session.error emits exactly one agent_error update for a classified API error
  • ContextOverflowError emits no agent_error
  • MessageAbortedError emits no agent_error
  • unknown session IDs emit no update
$ bun test test/acp/halt-emits-agent-error.test.ts
 11 pass
 0 fail
 24 expect() calls

$ bun run typecheck
# clean

# push hook
$ bun turbo typecheck
 Tasks:    12 successful, 12 total

Screenshots / recordings

N/A - backend / ACP wire change. Client rendering of the agent_error frame is downstream consumer work.

Checklist

  • All tests pass locally
  • Code follows the existing event-handling style
  • Diff is confined to ACP error mapping / event handling and tests
  • No new dependencies, schema migrations, or breaking protocol changes
  • bun run typecheck clean

truenorth-lj and others added 2 commits May 8, 2026 15:53
…or propagation

Phase 1B of the LLM error propagation refactor. This is the TS mirror
of the Python contract merged in tn-mono PR anomalyco#721 (commit 884bcf71).

Why a new file instead of an ambient .d.ts augmentation: the SDK's
SessionUpdate is a closed `type` alias (discriminated union), not an
`interface`. TypeScript declaration merging only works on interfaces,
so we cannot extend SessionUpdate via a `.d.ts` patch. The clean
TS-native approach is a local extended type that consumers opt into.

Adds:
- LLMErrorType: closed string union of 6 categories (budget, rate_limit,
  provider_unavailable, context_overflow, auth, unknown)
- LLMErrorPayload: snake_case wire shape mirroring Python LLMErrorPayload;
  retryable is on-the-wire explicit even though derivable from type
- AgentErrorUpdate: { sessionUpdate: "agent_error"; error; stopReason? }
- SessionUpdateWithAgentError: SessionUpdate | AgentErrorUpdate (Phase 4
  emit site at session/processor.ts halt() will use this)
- isRetriable(type): TS mirror of the Python is_retriable() classifier
- isAgentErrorUpdate(value): type guard for narrowing unknown frames

12 unit tests cover: classifier per type, vocabulary stability, type-guard
positive/negative paths, JSON round-trip, compile-time discriminated
union narrowing.

Spec: specs/20260508-llm-error-propagation/spec.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…retries

When the retry policy in session/retry.ts is exhausted, halt() updates
internal state (ctx.assistantMessage.error, Bus.Session.Event.Error,
EventV2.SessionEvent.Step.Failed.Sync) but never emits any ACP frame
to the connected client. From an ACP client's perspective the turn is
silently stuck — no stopReason, no error notification, no final
session/update — and the only signal is whatever timeout the client
imposes locally.

Add a session.error case in acp/agent.ts handleEvent() that translates
the SDK error variant into an LLMErrorPayload and emits the
agent_error session/update kind with stopReason: "error". The payload
prefers headers set by an upstream classifying proxy
(x-llm-error-type / x-llm-error-retryable / x-llm-error-reset-at /
retry-after) over status-code heuristics.

ContextOverflowError is intentionally NOT emitted — halt() routes that
variant into in-process compaction; the turn continues on a smaller
context window rather than ending in an error state.

Tests cover the SDK→LLMErrorPayload mapping for every error variant
(APIError with classification headers, APIError status-code fallback,
ContextOverflowError, ProviderAuthError, explicit retryable header
override) and the integration path that pushes a session.error event
through the agent's event subscription and asserts the resulting
agent_error session/update.

Builds on anomalyco#26306 (which adds the agent_error SessionUpdate kind +
LLMErrorPayload type definitions). Closes anomalyco#26350.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

The following comment was made by an LLM, it may be inaccurate:

Based on my search, PR #26306 is a related PR but not a duplicate:

The other PRs found (26292, 26192, 19116, 24369, 18443) address different aspects of retry/error handling but are not duplicates of the current PR.

Conclusion: No duplicate PRs found. The only related PR is #26306 (a required dependency that this PR builds upon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant