Skip to content

Sessions permanently stuck after server restart or stream interruption — no startup recovery for orphaned messages/tool parts #19023

@dzianisv

Description

@dzianisv

Description

When the OpenCode server restarts (or the process crashes) while a session is actively executing tool calls, the session gets permanently stuck in a "Thinking" state. The root cause is that there is no startup recovery that cleans up orphaned assistant messages and tool parts.

What happens

  1. A session is actively executing tool calls (e.g., bash commands)
  2. The server restarts or crashes
  3. The in-memory session state (SessionStatus) is lost — the session is no longer "busy"
  4. But the database state is stale: the last assistant message has time.completed = undefined (never completed) and tool parts remain in status: "running" forever
  5. The UI sees the incomplete assistant message and shows a permanent "Thinking" spinner
  6. The session cannot recover — sending a new message creates a new loop iteration, but the old orphaned message still exists

Root cause analysis

The existing cleanup in processor.ts:402-417 correctly handles the normal case — when the stream ends (normally, via error, or abort), it force-sets any non-terminal tool parts to status: "error". However, this cleanup only runs if the process survives long enough to reach it.

There is zero recovery at startup:

  • Session.initialize() does not scan for orphaned messages
  • SessionStatus (in-memory map) is empty after restart — no stale detection
  • No background watchdog checks for sessions stuck in busy state

The only defense is in toModelMessages() (message-v2.ts:740-746), which converts pending/running tool parts into "[Tool execution was interrupted]" when building the next LLM prompt. This helps contextual recovery if the user sends a new message, but the UI still shows the session as stuck because the orphaned assistant message has no time.completed.

Observed in production

  • Session ses_2f4299f5cffeVZfxCt3ViZ7eVJ stuck for 3+ hours with a git log tool part permanently in "running" status
  • Session ses_2e9127723ffeKJ1JpjLNS35B4z similar pattern (though this one was actually still running a long k8s test — but demonstrates the same vulnerability)

Relation to existing issues

This is the backend root cause behind several reported symptoms:

Open PRs #16907 and #17593 address frontend symptoms (making the UI more defensive about stale state), but neither fixes the backend root cause — orphaned messages and tool parts in the database.

Proposed fix

Startup recovery in Session or app bootstrap:

  1. On server start, query all messages where time.completed IS NULL and the message role = "assistant"
  2. For each orphaned message:
    • Set time.completed = Date.now()
    • Set all tool parts with status = "running" or status = "pending" to status = "error" with error = "Tool execution was interrupted (server restart)"
    • Emit Bus events so connected frontends update

This is a small, safe change — the cleanup logic already exists in processor.ts:402-417, it just needs to be callable from a recovery path at startup.

Steps to reproduce

  1. Start opencode serve
  2. Start a session that uses tool calls (e.g., ask it to run tests)
  3. Kill the server process while tools are executing (kill -9)
  4. Restart the server
  5. Open the session in the UI — it shows permanent "Thinking" spinner
  6. Session status API returns {} (idle) but the UI is stuck

Environment

  • opencode serve (long-running, multiple sessions)
  • macOS / Linux
  • Any provider (observed with gpt-5.3-codex via github-copilot)

OpenCode version

Latest dev branch (commit 814a515a8)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions