fix: HTTP transport + RPC endpoint to prevent bridge zombie accumulation#1813
fix: HTTP transport + RPC endpoint to prevent bridge zombie accumulation#1813fayenix wants to merge 6 commits into
Conversation
- Add POST /api/v1/rpc (JSON-RPC-over-HTTP) for out-of-process transport - Add MemosHttpClient class (Python HTTP client, drop-in for MemosBridgeClient) - Provider prefers HTTP when daemon already running on :18800 - Add kill_zombie_bridges() and probe_viewer_status() to daemon_manager - Exempt /api/v1/rpc from session cookie auth (loopback-only) - Remove tracked ephemeral files (daemon/bridge.pid, memos-local/*) - Update .gitignore for runtime state
Memtensor-AI
left a comment
There was a problem hiding this comment.
Needs changes — good architecture, a few issues to address
The three-layer approach is sound and the PR description is excellent. Here are the specific issues:
Security — Auth bypass is too broad
auth.ts — Exempting /api/v1/rpc from session auth entirely is risky. This endpoint gives full access to every dispatched method (turn.start, subagent.record, etc.). Even on loopback, any local process can now invoke arbitrary RPC methods without authentication.
if (pathname === "/api/v1/rpc") return true;Consider restricting to loopback source IP check inside this condition, or requiring the API key for RPC calls. A malicious local process (browser extension, compromised npm script) could abuse this.
Bug — Notification handling returns undefined into a batch array
rpc.ts line ~107 — handleSingle returns undefined for notifications, but in batch mode those undefined values end up in the results array and get serialized as null:
const results = await Promise.all(
parsed.map((item) => handleSingle(item, dispatch)),
);
return results; // contains undefined entries for notificationsPer JSON-RPC 2.0 spec, notifications must not produce response objects. Filter them out:
return results.filter((r) => r !== undefined);Also handle the case where all items are notifications (return 204 or empty).
Bug — kill_zombie_bridges is Linux-only
daemon_manager.py — ss -tlnp doesn't exist on macOS. If this project runs on macOS dev machines, this will silently fail (which is fine) but then ps aux parsing will kill all bridge.cjs processes including the daemon since daemon_pid will be None:
if daemon_pid is not None and pid == daemon_pid:
continue
# If daemon_pid is None, we skip nothing → kill the daemon tooFix: if daemon_pid is None, skip killing entirely or use lsof -iTCP:18800 -sTCP:LISTEN as a macOS fallback.
Minor issues
-
bridge_client.py—MemosHttpClientstores_host_handlersbut they're never invoked. The docstring mentions this, butregister_host_handlersilently swallows the registration. Ifhost.llm.completeis actually needed for certain methods, the HTTP path will silently return degraded results. Consider logging a warning on first call that requires it. -
.gitignore— Adding baredata/anddataat the end is redundant (the first covers the second). Alsodatawithout a trailing slash will match files nameddataanywhere. Likely intentional but worth confirming. -
__init__.py— Thehttp_bridgevariable is used after theexceptblock where it might not be assigned ifMemosHttpClient()constructor itself throws:
http_bridge: MemosHttpClient | None = MemosHttpClient() # can throw
# ...
except Exception as err:
http_bridge.close() # type: ignore — could be unboundInitialize http_bridge = None before the try block and guard the close.
Nit
The import re inside kill_zombie_bridges should be at module level per convention (it's already a stdlib import with no cost).
Overall the design is solid — HTTP-first with stdio fallback is the right call, and the zombie cleanup addresses a real operational pain point. The auth bypass and the "kill everything when daemon_pid is None" bug are the two I'd want fixed before merge.
- auth.ts: add reverse-proxy assumption comment to loopback RPC exemption - rpc.ts: filter undefined from batch notifications (JSON-RPC 2.0 §6); return 204 for all-notification batches - __init__.py: extract _connect_http_bridge() helper; fix unbound http_bridge on constructor exception; conditional cold-start sleep via startup_lock_active() - daemon_manager.py: add macOS lsof fallback; skip killing when daemon PID unidentified; import re to module level - .gitignore: remove redundant bare data entry
- isHermesChatRunning() pgrep pattern now matches 'gateway run' in addition to 'hermes chat', using regex 'hermes.*(chat|gateway)'. Previously the daemon could not detect the gateway process since it runs as 'hermes_cli.main gateway run', causing bridge status to permanently show 'disconnected' under gateway mode. - Daemon path now starts the bridge heartbeat (markConnected + startHeartbeat), matching the stdio bridge path. Without this, the daemon's bridge status was initialized to 'disconnected' and never promoted. With the heartbeat, it shows 'connected' and the stale-rule correctly transitions to 'reconnecting' between stdio bridge sessions.
After ensure_viewer_daemon() succeeds the adapter was immediately spawning a stdio bridge without checking whether the daemon it just started is now reachable over HTTP. This created a redundant stdio process alongside the HTTP daemon and blocked startup for up to 60s waiting for the stdio cold-start. Three changes in __init__.py: - initialize(): re-probe after ensure_viewer_daemon() and connect via HTTP if the daemon is ready; only fall through to stdio when HTTP fails or the daemon is not up - initialize(): unconditionally sleep 0.5s + re-probe on "free" status (replaces the startup_lock_active() conditional that missed the same-process startup case) - _reconnect_bridge(): same post-ensure_viewer_daemon() re-probe, so reconnects also prefer HTTP over stdio Removes the now-unused startup_lock_active import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Problem
The Hermes MemOS adapter spawns a new
bridge.cjs --no-viewersubprocess for everyAIAgentsession. Over time, 5-16 bridge processes accumulate — only one binds port 18800, the rest are zombies burning CPU and LLM tokens on redundant scoring loops against the same DB.Root cause:
MemTensorProvider.initialize()always callssubprocess.Popen()to spawn a stdio bridge. The daemon HTTP server exists at port 18800 but the Python adapter never connects to it — it always spawns its own process.Fix (Three Layers)
Layer 1 — Zombie cleanup
daemon_manager.py: On startup, enumerate bridge processes and kill any that aren't the port-holding daemon.Layer 2 — HTTP client
New
MemosHttpClientclass inbridge_client.pythat sends JSON-RPC tohttp://127.0.0.1:18800/api/v1/rpcinstead of spawning a subprocess.MemTensorProviderprefers HTTP when the daemon is running.Layer 3 — JSON-RPC endpoint
New
POST /api/v1/rpcroute that accepts JSON-RPC 2.0 envelopes and dispatches to the same handlers the stdio server uses. Gives immediate feature parity for all methods (turn.start,turn.end,subagent.record, etc.) with zero per-method route work.Files Changed
adapters/hermes/memos_provider/bridge_client.py—MemosHttpClientclassadapters/hermes/memos_provider/daemon_manager.py— zombie cleanup + daemon probeadapters/hermes/memos_provider/__init__.py— prefer HTTP when daemon runningapps/memos-local-plugin/server/routes/rpc.ts— new JSON-RPC endpointapps/memos-local-plugin/server/routes/registry.ts— register RPC routesapps/memos-local-plugin/server/routes/auth.ts— exempt loopback RPC from authTesting
All tested on a live Hermes+MemOS instance with 36K traces:
core.health✅turn.start✅ (returns episode + retrieval)skill.list✅ (27 skills)subagent.record✅ (returns traceId + episodeId)Related