Skip to content

feat: #164 sponsored ERC-4337 register + v2-demo harness restructure#200

Merged
hanwencheng merged 14 commits into
mainfrom
claude/trusting-montalcini-51bff9
Jun 5, 2026
Merged

feat: #164 sponsored ERC-4337 register + v2-demo harness restructure#200
hanwencheng merged 14 commits into
mainfrom
claude/trusting-montalcini-51bff9

Conversation

@hanwencheng
Copy link
Copy Markdown
Member

Summary

Two strands developed together on this branch:

  1. Migrate master authority to an ERC-4337 P-256 smart-account (resolves §11 gating findings) #164 broker-sponsored ERC-4337 master register — gas-free passkey master onboarding (E6/E7).
  2. v2-demo harness restructure + runbook fixes — the operator / sandbox / CI demo flow.

Broker (deploy-relevant)

  • New crates/agentkeys-broker-server/src/sponsor.rs (Stage A): the verifiable encoding + co-sign core for a VerifyingPaymaster-sponsored UserOp — pure functions, no chain client. user_op_hash ≡ EntryPoint.getUserOpHash, paymaster_get_hash ≡ VerifyingPaymaster.getHash, and broker_cosign recovers to brokerSigner. The broker EIP-191-co-signs the paymaster getHash (the Sybil gate) only for an authenticated J1 session.
  • lib.rs: +pub mod sponsor;. No broker route/handler changed — the running broker's HTTP behavior is unchanged; the module is exported for the sponsored-register flow but not yet wired to a live endpoint.
  • Stage B (EntryPoint.handleOps submission — needs an EVM client) is a follow-up, not in this PR.

CLI / daemon

  • CLI: k11 webauthn passkey keygen/sign + the sponsored-register flow.
  • Daemon: ui_bridge wires the flow into the desktop UI.

Harness + docs

  • harness/v2-demo.sh: single 5-phase front door (1-3 stages, 4 memory-plant, 5 wire) with PHASE.STEP addressing (--from 4.1, --only 3.11). Sandbox auto-detect now probes the aiosandbox HTTP API ($SANDBOX_URL/healthz/v1/sandbox) instead of a local openviking install.
  • v2-stage3: agent-side steps (11-12 / 14-15) defer to the sandbox on the operator (green, never fail); mock agent is CI-only. Clearer stale-broker guidance on the feat(broker,worker): skip scope check for master-self (operator==actor) #195 master-self step.
  • v2-stage1/2: Touch-ID WebAuthn by default for operators, stub for CI.
  • New harness/CLAUDE.md (harness rules extracted from root CLAUDE.md), operator runbooks (operator-runbook-harness.md, operator-runbook-web-memory.md), and erc4337 register/fund helpers.

Deploy note

The step-16 (#195 master-self scope skip) fix is already on origin/main (commit 5bd3bf0) — the prod broker just needs a redeploy (bash scripts/setup-broker-host.sh --ref main on the broker host). This PR adds the dormant sponsor module on top; redeploying after merge picks it up with no behavior change to existing routes.

Verification

  • cargo check (broker + cli + daemon): ✅
  • cargo test -p agentkeys-broker-server: ✅ 7 passed (SES integration test ignored — needs live AWS)
  • bash -n on all touched harness scripts: ✅

🤖 Generated with Claude Code

…s restructure

Broker-sponsored, gas-free ERC-4337 master onboarding (#164 E6/E7): new broker 'sponsor' module — verifiable UserOp + VerifyingPaymaster.getHash encoding with an EIP-191 broker co-sign (pure functions, byte-exact with the live contracts, zero-gas read-only verification); lib.rs exports it. CLI gains k11 webauthn passkey keygen/sign + the sponsored-register flow; daemon ui_bridge wires the flow into the desktop UI.

Harness + docs: v2-demo.sh restructured into a single 5-phase front door (1-3 stages, 4 memory-plant, 5 wire) with PHASE.STEP addressing; sandbox auto-detect now probes the aiosandbox HTTP API (not a local openviking install). v2-stage3 agent-side steps (11-12/14-15) DEFER to the sandbox on the operator (green) with mock reserved for CI; clearer stale-broker guidance on the #195 master-self step. stage1/2 default to Touch-ID WebAuthn for operators, stub for CI. Adds harness/CLAUDE.md (harness rules extracted from root CLAUDE.md), operator runbooks (harness, web-memory), and erc4337 register/fund helpers.
… don't trip a stale Cargo.lock

setup-broker-host.sh --ref did 'git checkout -f' + 'git pull --ff-only', which can no-op against a stale local branch tip or leave a build-modified Cargo.lock on disk. A subsequent 'cargo build --locked' then fails with 'cannot update the lock file'. A deploy target must match origin EXACTLY — replace the ff-only pull with 'git reset --hard origin/$PULL_REF' (HEAD + index + working tree, Cargo.lock included). Idempotent.
…nt -> success)

Closes the gap where only the NEGATIVE/skip scope paths were asserted. Worker verify.rs: two new unit tests with an in-process std::net JSON-RPC mock (no new dep, Cargo.lock untouched) — check_chain_scope_ok_when_chain_grants (operator!=actor, chain returns true -> Ok) and check_chain_scope_rejects_when_chain_denies (false -> NotInScope). Harness v2-stage3: new standalone step 18 'POSITIVE: granted agent (operator!=actor) mints memory cap for the GRANTED service -> 200' — extracts the scope-grant assertion out of the steps 11-12 roundtrip; operator-authenticated mint (no agent key), defers on a §10.2 agent whose device isn't paired yet, mocks on CI. Completes the scope triad with step 16 (master-self skip) + step 17 (cross-actor denied). Cleanup renumbered 18->19, STEP_TOTAL=19. Runbook updated.
…{evm_address} (HTTP 422)

Phase 4 step 2 POSTed {evm_address:...} to /v1/auth/wallet/start, but the broker's WalletStartRequest requires {address: String, chain_id: u64} (both mandatory) — so axum rejected it with 422 'missing field address'. Reproduced live: {evm_address} -> 422, {address,chain_id} -> 200. Aligns memory-plant with the broker contract + the shape stage-1/stage-3/web-memory-bootstrap already use. Broker was correct; this was a stale client field name.
…P 400 malformed address)

DEPLOYER_ADDR is already 0x-prefixed (cast wallet address output), but step 2 prepended another 0x -> '0x0x941cb1…' -> broker 400 'malformed address'. The wrong field name (422, prior fix) had masked this. Reproduced live: 0x0x… -> 400, 0x… -> 200. The omni (line 74) + cap-mint (line 98) already use the correct forms (broker hashes agentkeysevm+0x-addr, verified in omni_account.rs). Also switch the wallet start/verify curls from -sSf to -sS --fail-with-body so the broker's error JSON is shown on 4xx instead of an opaque 'curl: (NN) … error: CODE' (this step hid its cause twice).
[high] Bash 3.2 (memory-plant-demo.sh): dropped the 'declare -A CAP' associative array (bash 4+; the operator platform is macOS bash 3.2.57 where it errors and CAP[$ns] under set -u is an unbound arithmetic var). Step 3 now just proves cap-mint per namespace; step 4 re-mints fresh (short-TTL). Verified runnable under 3.2.

[high] Partial plant (daemon ui_bridge.rs): the real-chain plant only failed when ZERO entries planted, so a partial (some namespaces succeed, one fails) returned 200 + audit + updated state. Now ANY durable-write failure returns 502 before the success audit/response; succeeded writes stay in master_memory so a re-plant is idempotent and resumes. 35 ui_bridge tests still pass.

[medium] Phase 5 skip (v2-demo.sh): an auto-skip (no aiosandbox) returned 0 and printed 'all green' + 'agent paired' — an unexecuted proof read as a pass. Now run_wire_phase records WIRE_RESULT (wired/skipped/disabled); an auto-skip reports 'v2-demo INCOMPLETE' and exits non-zero, the loop shows the phase as SKIPPED (not green ok), and the final pass/paired text only prints when the wire actually ran. --wire none is the explicit clean-skip escape (CI uses it). Also fixed the skip hint to the correct aiosandbox 'docker run …' (was the wrong openviking-sandbox-setup.sh). Runbook + harness/CLAUDE.md synced.
…eferred) + real memory plant

dev.sh launched 'agentkeys-daemon --ui-bridge' with NO --register-master-script, so finish_chain_register hit its 'register_master_script = None' branch and silently SKIPPED the on-chain registerFirstMasterDevice (chain: none) — the ceremony was deferred while K11 enroll still reported success. It also passed no --memory-url/--memory-role-arn, so the plant button fell back to the in-memory RwLock instead of cap-mint → STS → worker → S3.

dev.sh now sources scripts/operator-workstation.env and ALWAYS passes --register-master-script (in-repo heima-register-first-master.sh; a missing deployer key / chain config now surfaces chain_error, never a silent skip), plus --memory-url/--memory-role-arn/--region when the env supplies them (real plant; logged). The daemon↔script arg contract was already correct (--operator-omni/--actor-omni/--k11-cose-hex/--k11-cred-id/--rp-id-hash) and real_memory_ctx sources the device hash from the K11-finish register, so the un-deferred ceremony is exactly what feeds the plant.

Name drift called out: the daemon's --memory-url env is AGENTKEYS_MEMORY_URL but operator-workstation.env spells it AGENTKEYS_WORKER_MEMORY_URL; bridged in dev.sh via the explicit flag (accepts either). Also un-stale the ui_bridge.rs module doc that still claimed the register is stubbed.
…e agent's memory scope)

Phase 5 ran 'phase1-wire-demo.sh --real' with NO --webauthn, so the wire's P.3 scope grant (heima-scope-set --webauthn) was SKIPPED — the §10.2 agent paired but the master never granted it the memory:<ns> scope. The agent's memory.get(travel) mints a cap for service 'memory:travel' (mcp-server/src/tools/memory.rs: format!("memory:{namespace}")), the broker checks isServiceInScope(O_master, agent, memory:travel) and returns service_not_in_scope -> Act1 (3.1) + inject (4.2) fail. Now auto + real pass --real --webauthn so the master grants memory:<ns> via Touch ID (one prompt, like phases 1-2; heima-scope-set is idempotent so re-runs skip). Service strings match (grant + cap both memory:<ns>); the master's K11 is enrolled+registered in phases 1-2 so setScopeWithWebauthn verifies. Runbook + harness/CLAUDE.md synced.
…nnect a daemon' toast

The /memory plant button only renders when the daemon is connected (status.kind==='connected'), so a plant failure is almost never 'no daemon' — yet plantDone's else-branch always showed 'Connect a daemon to plant prepared memory', masking the daemon's actual reason (which postJson already captured in r.status.detail, e.g. 409 'no master session — complete onboarding first' / 'master device not registered on chain yet', or a 502 worker error). Now it extracts + shows the real reason. tsc --noEmit clean.
…lant 400)

memory_put_real/memory_get_real send operator_omni + actor_omni = ctx.omni, sourced from the onboarding session omni which is stored BARE (no 0x). The broker cap-mint input-validates that operator_omni starts with 0x and 400s ('operator_omni must start with 0x') before normalizing — so the web plant failed AFTER the device was registered. Normalize ctx.omni to 0x once in real_memory_ctx (covers put + get); the broker normalize_hex32's it for the device-binding match, and master-self (operator==actor) hits the #195 skip so no scope grant is needed. cargo check -p agentkeys-daemon clean.
…s-store gap)

After a successful plant, plantDone read listMasterMemory but only setMemories on ok and toasted just the plant counts — so a daemon-cache miss (e.g. after a restart) silently showed an empty list. Now the toast shows '<planted> new … <list.length> in the memory view' (so 'N new but 0 in view' is visible) and surfaces a failed list GET. Note: GET /v1/master/memory reads the daemon IN-MEMORY cache, not S3 — so the list is empty after any daemon restart even though the data is durable in S3.
… (Phase 0)

Phase 0 of docs/plan/web-flow/config-data-class-memory-list.md (lazy, config-driven memory list). Adds the DataClass::Config variant to both cap.rs + verify.rs (serializes 'config'), the broker cap_config_store/cap_config_fetch handlers (statically derive {op, data_class: Config}) + routes /v1/cap/config-store + /v1/cap/config-fetch. check_data_class is generic, so a Config cap is rejected by the cred + memory workers (and a memory cap by the config worker) — covered by new unit tests. Infra-free: the endpoints mint Config caps, but the config bucket/role/worker land in Phases 1-2. cargo check (broker) clean; 4 worker data_class tests pass.
harness-ci 'cargo fmt + clippy + test' failed at fmt: this PR's sponsor/webauthn/cli/daemon/verify code (committed earlier) wasn't rustfmt-clean, plus my new config routes in lib.rs. Ran cargo fmt --all (6 PR files reformatted, no unrelated drift). Also fixed clippy::unusual_byte_groupings in sponsor.rs:189 (0x0102_03 -> 0x010203, value-identical) that -D warnings rejected in the lib-test target. Verified locally: fmt --check clean, clippy --workspace --all-targets -- -D warnings exit 0, cargo test --workspace 40 results ok / 0 failed.
… seed seam (W6)

Implements wire-real-paths W6 as a v2-demo PHASE, not a standalone script (no second front door, no re-bootstrap). daemon: add --ui-bridge-seed-session-jwt + --ui-bridge-seed-omni — seeds the ui-bridge onboarding session with the master's existing J1 + omni so the parity phase drives the REAL plant chain WITHOUT re-running interactive email/WebAuthn onboarding (pairs with the existing --master-device-key-hash). harness/web-parity-demo.sh = phase 6: reuses the preflight build + live chain/broker + the master registered in phases 1-2, boots agentkeys-daemon --ui-bridge SEEDED, plants a probe ns via POST /v1/master/memory/plant; a 200 proves the daemon's chain (cap-mint→STS→worker→S3) == the agent/harness chain — the web↔harness drift gate. Cost: one daemon boot + one plant, no re-build/re-chain/re-enroll; real-only (skips without a broker). Wired into v2-demo (default 1→6, --from/--stage/--only addressing). Docs synced (runbook, harness/CLAUDE.md, wire-real-paths W6). cargo fmt+clippy --workspace --all-targets clean; bash -n clean. NOTE: statically verified (compiles + wired + prereqs met); the live end-to-end smoke is bash harness/v2-demo.sh --stage 6 on real infra.
@hanwencheng hanwencheng merged commit 6d916e7 into main Jun 5, 2026
8 checks passed
hanwencheng added a commit that referenced this pull request Jun 5, 2026
…rker chain

The broker/worker HTTP chain was hand-coded in three places (MCP HttpBackend,
daemon ui_bridge, harness bash), the structural cause of the #200 drift bugs
(evm_address vs {address,chain_id}, bare-vs-0x omni, per-namespace field
shapes). Collapse it behind one crate so drift is a COMPILE error (Rust callers
share the types) or a FIXTURE mismatch (the harness gate), not a runtime 4xx.

New crate agentkeys-backend-client (the dual of broker-server / worker-*):
- protocol.rs: every cap-mint / worker / audit wire shape, the memory:<ns>
  service builder, and the 0x-omni normalizer (the daemon's old inline bug site)
- client.rs: BackendClient — cap-mint (4 data-class endpoints) -> STS relay ->
  worker put/get -> audit append (the reference impl lifted out of HttpBackend)
- fixtures.rs + dump-protocol-fixtures bin: canonical fixtures serialized from
  the serde types + frozen key-set pins

Collapse the duplicates (net -355 LOC in existing files):
- MCP HttpBackend -> thin delegate over BackendClient; backend wire-shape
  submodules (broker/memory/audit) deleted, re-exported from the crate so the
  Backend trait + InMemoryBackend + tools keep their crate::backend::* paths
- daemon memory_put_real / real_memory_ctx -> call the shared client (kills the
  duplicate cap-mint body + the inline 0x-normalize where the bugs lived)

Enforce (fold-systemic-fixes-into-enforcement):
- scripts/check-backend-fixture-drift.sh: diffs every # @backend-fixture-
  annotated bash body against the crate-emitted fixtures (catches add/rename/drop)
- harness-ci.yml rust-checks runs the fixture --check + the bash gate on every PR
  touching crates/**, harness/**, scripts/**
- root CLAUDE.md + harness/CLAUDE.md "broker/worker shapes have ONE owner" rule;
  arch.md component inventory updated
hanwencheng added a commit that referenced this pull request Jun 6, 2026
…y list + lazy detail + codex hardening) (#205)

* feat(worker-config): #201 Config data-class substrate — infra + worker + isolation tests (Phases 1-3)

Stand up the DataClass::Config substrate end-to-end (Phases 1-3 of the
config-driven memory-list plan; Phase 0 cap layer landed in #200). The
visible daemon/frontend behavior (Phases 4-5) is a follow-up, gated on the
operator deploying this (per the issue's dependency chain 4 -> 0,1,2).

Phase 1 — infra (idempotent mirrors of the memory scripts):
- scripts/provision-config-bucket.sh, provision-config-role.sh,
  apply-config-bucket-policy.sh (config/ prefix, own bucket + role per arch.md
  §17.2; split-statement v3 bucket policy)
- CONFIG_BUCKET / CONFIG_ROLE_ARN + config worker host/URL in
  operator-workstation.env; wired into setup-cloud.sh step 13

Phase 2 — config worker (master-only):
- new agentkeys-worker-config crate (mirror of agentkeys-worker-memory; config/
  S3 prefix, $CONFIG_BUCKET, AGENTKEYS_CONFIG_KEK_HEX, DataClass::Config, :9096)
- full setup-broker-host.sh wiring (build/install/env/systemd/nginx/firewall/
  certbot/post-install summary)

Phase 3 — isolation tests (test-discipline rule):
- harness/v2-stage3-demo.sh steps 19-21: config layer-3/4 (own-prefix write OK +
  cross-bucket AccessDenied) + cap data-class-mismatch (config<->memory,
  config<->cred). All master-self -> run on the operator, no sandbox defer;
  skip cleanly until the operator provisions config infra + redeploys the broker.

Source-of-truth updates: arch.md (§5 canonical names, §17.2/.3/.5, four-layer
table, storage diagram), CLAUDE.md (per-data-class table + six cap endpoints +
'third data class landed'), operator-runbook-harness.md, harness/CLAUDE.md, plan doc.

Verified: config worker dev+release build + unit tests green; cargo check --workspace
clean (all 17 crates); all bash scripts syntax-clean.

* fix(harness): graceful skip (not die) when a cross-class worker is unreachable

post_cross_class folded curl's stderr into the returned code via 2>&1, so an
UNDEPLOYED worker (e.g. config.litentry.org before the broker redeploy) yielded
rc="curl: (35) SSL_ERROR_SYSCALL...\n000000" instead of a clean "000". That no
longer matched master_cross_class_rejection's 000|502|503|504) case and fell
through to die — turning the intended graceful prereq_missing
(config-worker-unreachable) at stage-3 step 21 into a hard failure.

Send curl's transport error to a side file so rc is just the 3-digit %{http_code}
(000 on transport failure), and surface that error as the body for diagnostics.
Also hardens steps 14-15 (same helper) — clean rc + diagnostic body.

Verified: repro against the unreachable config.litentry.org returns clean 000 ->
prereq_missing fires; bash -n clean.

* fix(infra): #201 wire config worker into DNS + worker verify (config.litentry.org A record)

The config worker host was added to operator-workstation.env but NOT to the two
DNS provisioning paths nor the worker health-check, so config.litentry.org never
got an A record → unreachable (the stage-3 step-21 SSL_ERROR_SYSCALL).

Add WORKER_CONFIG_HOST everywhere the four original workers are enumerated:
- scripts/setup-cloud.sh do_step_6 — the PRIMARY DNS path (its own change-batch,
  not a delegate): + config A record + env validation (8 A records / 14 UPSERTs).
- scripts/dns-upsert-workers.sh — the standalone re-UPSERT path: + config in the
  sanity loop, change-batch, plan printout, DoH verify loop, and certbot next-steps.
- scripts/verify-workers.sh — + config:/healthz ("ok":true), All 5 workers green.
- operator-workstation.env — comment now says five workers incl. config.

Verified: bash -n clean on all three; setup-cloud change-batch builds 14 records;
dns-upsert change-batch valid JSON.

* refactor(infra): #201 setup-cloud delegates worker DNS to dns-upsert-workers.sh (single source of truth)

Wire the config-worker setup fully into the idempotent orchestrator so nobody
runs DNS by hand, and kill the dual-maintenance drift that left config.litentry.org
without an A record (two hardcoded worker lists: setup-cloud step 6 + dns-upsert).

- dns-upsert-workers.sh: new --no-verify (UPSERT then exit, skipping the INSYNC/DoH
  wait + operator next-steps printout) for orchestrator use.
- setup-cloud.sh step 6: keep DKIM/MX/TXT + broker/signer/mcp inline (9 records);
  DELEGATE the 5 service-worker A records (audit/email/cred/memory/config) to
  dns-upsert-workers.sh --eip $EIP --no-verify (honors --dry-run + the same
  ENV_FILE so the prod/test split carries through). One source of truth → a new
  worker can never again be added to one list but not the other.
- The 3 config provision scripts were already delegated in step 13 (no change).
- cloud-bootstrap.md: config.litentry.org added to the certbot recipe (+ explicit
  one-shot form), the --config-host flag, the DNS A-record list, the worker-subdomain
  table, the per-worker env-file glob, the build/nginx/test-subdomain references.

Verified: bash -n clean on all three; setup-cloud inline batch builds 9 records;
dns-upsert --no-verify parses + early-exits; cloud-bootstrap certbot loop includes CONFIG_HOST.

* perf(deploy): #201 sccache compiler cache in setup-broker-host (fast re-deploys + branch switches)

The broker host redeploys often and switches branches via --ref. cargo already
caches deps in $REPO_ROOT/target (we never clean on the happy path), but a
git checkout -f rewrites changed files' mtimes → cargo re-fingerprints + rebuilds
them, and a cold/wiped target/ recompiles the whole aws-sdk/tokio tree.

Add sccache — a CONTENT-addressed compiler cache keyed on each crate's actual
inputs (not mtime/branch/target state), persisted in $SCCACHE_DIR independent of
target/. Identical inputs hit the cache regardless of branch or a cold target/.

- setup_build_cache(): installs sccache (prebuilt musl binary, arch-detected →
  cargo install fallback → skip), exports RUSTC_WRAPPER + SCCACHE_DIR, starts the
  server. Best-effort + idempotent + NON-FATAL (deploy proceeds with plain cargo
  if install fails). Opt out: AGENTKEYS_NO_SCCACHE=1; pin: SCCACHE_VERSION=vX.Y.Z.
- Prints 'sccache stats' after the worker build — visible proof (re-deploys =
  mostly cache hits).
- cloud-bootstrap.md documents the cache + the opt-out.

Verified: bash -n clean.

Note: this does NOT change what gets built; my earlier #201 commits were all
shell/docs (zero Rust), so a re-run that only pulls them recompiles nothing.

* docs(cloud-bootstrap): #201 cert issuance MUST run on the broker (not a VPN'd laptop) + ACME pre-check

Operator hit a certbot 'unauthorized … 404' on config.litentry.org because
certbot --webroot was run on a local box (behind a VPN): the challenge file
landed there, but Let's Encrypt validates against the hostname's PUBLIC IP = the
broker, which had no such file. The nginx 1.28.3 (VPN proxy) vs 1.24.0 (broker)
version split in the 404 pages was the tell.

Fold-back to §5b so the next operator can't repeat it:
- Loud '⚠️ run EVERY command ON THE BROKER HOST' callout explaining the
  --webroot-writes-local vs CA-validates-public-IP mechanism + the WARP/Zscaler
  interception trap (laptop curl of <host> hits the VPN's nginx, not the broker).
- A cheap local ACME pre-check (nginx reload + probe file + curl localhost with
  Host header) BEFORE the certbot loop — a freshly-added worker (config) needs a
  reload; 'nginx -T' showing the vhost does NOT mean the running process loaded it.
- New troubleshooting entry for the exact 'unauthorized … 404' error covering both
  causes (wrong host; vhost not reloaded).

Docs only; fences balanced.

* fix(infra): #201 dns-upsert derives worker EIP from broker's A record (not 'first associated EIP')

Root cause of the config (and all-worker) cert failures: dns-upsert-workers.sh
derived the EIP via `describe-addresses | first`, which can't distinguish the
PROD broker EIP from the TEST broker EIP when both are allocated. It silently
grabbed the test EIP (3.214.219.209) and pointed all 5 worker A records at the
test broker, while broker/signer stayed on prod (54.164.117.252). Let's Encrypt
then validated config.litentry.org against the test box (404).

Derive the workers' EIP from BROKER_HOST's OWN Route 53 A record instead — the
workers co-locate with the broker, so their records MUST mirror it. This is
env-aware (BROKER_HOST is broker.${ZONE} for prod vs test-broker.${ZONE} for test)
and authoritative. Add a co-location guard that warns when the chosen/passed EIP
disagrees with the broker's A record (catches a prod/test mixup early).

cloud-bootstrap.md §5b gains a troubleshooting entry for 'worker cert fails but
broker works' with a DoH cross-check loop.

Verified live (--dry-run against the real zone): derives 54.164.117.252 and sets
all 5 worker records to it; bash -n clean.

* fix(infra): #201 dns-upsert derives EIP by broker tag (prod vs CI/test), matching setup-cloud step 4

Prod and the CI/test broker are SEPARATE machines with SEPARATE EIPs. The previous
fix derived from broker.${ZONE}'s A record (works for prod, but chicken-egg on a
fresh test box + a different mechanism than the bootstrap). Switch to the SAME
tag-based, TEST_MODE-aware derivation setup-cloud.sh step 4 uses — one source of
truth:
  prod  → describe-addresses --filters Name=tag:Name,Values=agentkeys-broker-eip
  test  → ...Values=agentkeys-broker-eip-test   (--test, or a *test* ENV_FILE)

- New --test flag + auto-detect from a *test* ENV_FILE (switches to
  operator-workstation.test.env), mirroring setup-cloud.
- Keep the broker-A-record co-location cross-check as a warn-only guard.

Verified live (--dry-run): prod → 54.164.117.252 (tag agentkeys-broker-eip);
--test → 3.214.219.209 (tag agentkeys-broker-eip-test). bash -n clean.

* docs(CLAUDE): #201 always verify the broker IP env-aware (prod vs CI/test = separate EIPs)

Two broker EC2 instances exist with separate EIPs, distinguished by the EIP Name
tag (agentkeys-broker-eip vs agentkeys-broker-eip-test). 'describe-addresses
first-match' silently picks the wrong one — it pointed all 5 worker A records at
the test broker while broker/signer were on prod (multi-round LE 404s). New AWS-
gotchas subsection: never first-match; derive by the env-aware tag (setup-cloud
step 4 / dns-upsert-workers.sh), curl ifconfig.me on the host, DoH-cross-check
workers == broker for DNS.

* perf(deploy): #201 keep Rust toolchain across broker re-deploys (the real slow-rebuild cause)

setup-broker-host.sh deleted /root/.cargo + /root/.rustup at the END of every run
(~1.5GB reclaim). So every re-deploy re-downloaded the WHOLE rustup toolchain +
all 372 crate sources — minutes of pure waste (target/ persists in the repo dir,
which is why the compile itself was only ~50s, but the toolchain+registry did not).

- KEEP the toolchain by default; gate the delete behind a new --reclaim-toolchain
  flag (pass it on a final/one-shot deploy to free the disk).
- Pre-source $HOME/.cargo/env in the build-prereqs step so a kept toolchain is on
  PATH on a non-login sudo shell — otherwise `have rustup` is false and it
  reinstalls anyway even with /root/.cargo present.
- Header usage + post-run NOTE updated to reflect keep-by-default.

Combined with the sccache change (86d18be), re-deploys now skip toolchain DL +
crate-registry DL + most recompilation. bash -n clean.

* refactor(deploy): #201 hard rule — 3 idempotent entry points + --ci env flag

Per the deploy-script governance: there are exactly THREE idempotent deployment
orchestrators (setup-cloud.sh / setup-broker-host.sh / setup-heima.sh); every
other mutation is wired into one of them. Codify it in CLAUDE.md + standardise the
environment flag.

- Add --ci (canonical CI-env flag; --test retained as alias) to all 3 entry points
  + dns-upsert-workers.sh. Plain run = local/prod; --ci = CI (selects the
  agentkeys-broker-eip-test EIP, -test IAM/buckets, *.test.env).
- CLAUDE.md: new 'Three idempotent deployment entry points' section (ownership
  table, flag convention, HARD wire-in rule, exempt list). Verified mcp-host is
  already wired into setup-broker-host (#152 re-converge); setup-dev-env is a
  dev-workstation bootstrap (exempt, not a deploy).

Verified: bash -n clean; --ci --dry-run derives the test EIP (3.214.219.209).

* fix(infra): #201 test stack — config data class in test env + step-13 ENV_FILE passthrough + cloud-bootstrap --ci

Two real test-mode bugs + doc drift, found while fitting the scripts to cloud-bootstrap.md:

- operator-workstation.test.env was MISSING the entire config (#201) data class
  (CONFIG_ROLE_ARN / CONFIG_BUCKET / WORKER_CONFIG_HOST / *_URL) — so
  setup-cloud.sh --ci / setup-broker-host.sh --ci would die on the WORKER_CONFIG_HOST
  validation. Added the -test trio (agentkeys-config-role-test, agentkeys-config-test-<acct>,
  config-test.litentry.org).
- setup-cloud.sh step 13 called provision/apply-*.sh WITHOUT ENV_FILE; each re-sources
  operator-workstation.env (prod) and overwrites inherited CONFIG_BUCKET, so --ci would
  silently provision PROD buckets. Now passes ENV_FILE through (DRY loop) → -test buckets.
- cloud-bootstrap.md: --test → --ci (alias noted) in quick-start; added config bucket to
  'what --ci derives'; corrected the stale 'toolchain deleted each run' note to the new
  keep-by-default + --reclaim-toolchain behavior; called out prod vs CI = separate EIPs.

Verified: test env config trio resolves; setup-cloud bash -n clean.

* docs(CLAUDE): #201 codify env-file + provisioner discipline for new data classes

This session's two test-mode bugs were systemic, not one-offs — fold them into the
#90 isolation section's data-class checklist so the next data-class-adder can't repeat:
1. a new data class MUST be added to BOTH operator-workstation.env AND .test.env
   (.test.env is not auto-derived; a prod-only key breaks the whole --ci path).
2. setup-cloud.sh delegation MUST pass ENV_FILE to provision/apply helpers (they
   re-source prod env + overwrite inherited $BUCKET, so --ci would hit prod buckets).
Includes the verify step (setup-cloud.sh --ci --dry-run must name -test resources).

* fix(harness): stage-3 Totals line renders colors (escapes in format string, not %s args)

The Totals summary printed literal \033[1;32m… because the C_* color vars (literal
"\033[…" strings) were passed as printf %s ARGS — printf only interprets \033 in
the FORMAT string, not in args. Moved the colors into the format string, matching
the ${C_*}-in-format pattern used everywhere else. TTY-gated defs unchanged, so
non-TTY/CI runs stay plain. Verified via cat -v (^[ = real ESC); bash -n clean.

* feat(daemon,worker): #201 Phases 4-5 — Config taxonomy memory list + lazy detail + codex hardening

Phase 4 (daemon): read/write the memory-types taxonomy via the Config data
class (--config-url/--config-role-arn); GET /v1/master/memory returns
categories from the taxonomy (no decrypt, cache fallback); new lazy
GET /v1/master/memory/entry?ns=&key=; plant writes per-namespace JSON arrays.
CLI hook memory-inject renders the array (single-body still injects). harness
memory-plant-demo + web-parity write/pass the new shape.

Phase 5 (frontend): apps/parent-control lists categories, decrypts a
namespace's entries on demand; plant re-fetches categories.

Codex adversarial-review hardening:
- finding 1 (data loss): plant is now a read-modify-write merge under a
  plant_lock (durable blob preserved; abort-on-read-error, never overwrite).
- finding 2 (silent failure): memory+config workers return 404 on NoSuchKey;
  list 502s on a configured-but-broken Config; plant returns taxonomy_status.

Workers changed → requires a setup-broker-host.sh redeploy for the 404 behavior.

* fix(ci): #201 stage-3 config steps — emit config env keys + tolerate config-role-missing

harness-e2e crashed at stage-3 step 19 with `CONFIG_ROLE_ARN: unbound variable`:
the CI env-materializer (harness-ci.yml) never emitted the config data-class keys
Phase 3 added to the stage-3 demo, and the demo runs under `set -u`.

- harness-ci.yml: materialize CONFIG_BUCKET / CONFIG_ROLE_ARN /
  AGENTKEYS_WORKER_CONFIG_URL (derived -test values, no new secret); allow the
  config-role-missing skip (operator one-shot, like scope-not-set) so step 19
  skips cleanly until the test config bucket/role are provisioned. Steps 20-21
  (config cap-mismatch) still run against the deployed config worker.
- v2-stage3-demo.sh: default the config vars to empty after sourcing the env
  file → degrade via prereq_missing instead of an unbound-variable abort.
- CLAUDE.md: fold the materializer into the env-file discipline (3rd place a
  new data class's keys must land).

* docs(#207): onboarding/classifier design spec + policy/scope/namespace wiki

Add the product/onboarding view of the classifier design (#178) on top of the landed Config substrate (#201): the two config-init entry points (default preset + NL->COMPILE), connect-time classifier auto-distribution of cred + memory scopes (one pattern, two axes), the four security invariants, and the resolved decisions tracked in #207 (telemetry split to #208).

- docs/plan/web-flow/onboarding-classifier-distribution.md (new spec)
- docs/wiki/policy-scope-namespace.md (new terminology reference, lint-clean)
- docs/arch.md section 5 canonical-names row (policy/scope/namespace/category/service)
- docs/plan/classifier-service.md cross-links

* ci(harness): allow config-worker-unreachable skip on the test env (#201)

stage-3 step 21 hits the config worker HTTPS endpoint (config-test.<zone>), whose cert can't issue until the config-test DNS record is provisioned by the operator one-shot (setup-cloud.sh --ci) — the SAME one-shot already tolerated via config-role-missing. Add config-worker-unreachable to the stage-3 allow-skip so CI skips step 21 cleanly until the test config infra exists; step 20 + the agentkeys-worker-config unit tests still cover the config cap-data-class-mismatch. harness/CLAUDE.md already documents steps 19-21 as 'skip until config infra is provisioned/deployed'. Drop the allowance once config-test is provisioned.
hanwencheng added a commit that referenced this pull request Jun 6, 2026
… combine, not just resolve

#205 (issue #201) landed a THIRD data class (Config): /v1/cap/config-{store,fetch}
+ an agentkeys-worker-config worker + a hand-rolled daemon config/per-ns-memory
chain. #204 (#203) made agentkeys-backend-client the ONE owner of the broker/worker
protocol. Rather than let the two coexist as parallel hand-rolled vs crate-owned
chains, this merge folds #205's new surface INTO the #203 single-owner model.

Conflicts resolved (2 files):
- ui_bridge.rs: adopt #205's per-namespace storage model wholesale (memory_put_ns_real
  / memory_get_ns_real / RMW-under-plant-lock / real_config_ctx) — my per-entry
  memory_put_real + real_memory_client are SUPERSEDED, dropped. Kept my route consts
  (MASTER_MEMORY_{,PLANT_}ROUTE) + the plant-contract unit test, and #205's new
  /v1/master/memory/entry route. Swapped #205's inline 0x-normalize in the shared
  resolve_session_coords for the crate's normalize_omni_0x.
- memory-plant-demo.sh: keep #205's per-ns JSON-array blob + my @backend-fixture
  annotation.

Combine (#203 applied to #205's surface):
- crate: CapMintOp gains ConfigStore/ConfigFetch (6 cap endpoints now); add
  ConfigPutBody/ConfigGetBody + fixtures (regenerated, now 6).
- daemon mint_master_cap → BackendClient::cap_mint (the cap-mint body — the #200
  drift locus — is now the crate's BrokerCapRequest for memory AND config; one
  function covers all 4 routes). Worker put/get bodies (memory + config) build from
  the crate's MemoryPutBody/MemoryGetBody/ConfigPutBody/ConfigGetBody types; the raw
  POST stays in the daemon to reuse the once-minted STS creds across namespaces.
  Re-added agentkeys-provisioner to the daemon (still used for that STS mint).
- gate: config_put/config_get fixtures are pass-1-annotatable but EXCLUDED from
  pass-2 auto-detect (key-set-identical to cred bodies → would false-positive);
  documented in the gate + the fixtures README. #205's bash bodies (4-key ttl-omitted
  cap + ambiguous cred/config worker bodies) don't trip pass-2.
- docs: arch.md tree gains agentkeys-worker-config + updated backend-client note;
  root CLAUDE.md #203 rule updated for the 6 endpoints + config body types.

Verified: cargo build + clippy -D warnings + cargo test --workspace all clean (0
failures; plant-contract + config frozen tests pass); backend + web-api drift gates
+ fixture --check pass under LC_ALL=C.UTF-8; bash -n clean on all touched scripts.
hanwencheng added a commit that referenced this pull request Jun 6, 2026
…6 (web-parity)

harness-ci.yml ran v2-stage{1,2,3}-demo.sh in isolation — it predated the #200
v2-demo restructure and never picked up phase 4 (memory-plant) or phase 6
(web-parity). Phase 6 is the ONLY runtime proof of the daemon's web endpoint
(POST /v1/master/memory/plant → cap-mint → STS → worker → S3, the parent-control
app's path); stage 3 only exercises the CLI/curl path. The #203
check-web-api-drift.sh gate covers its SHAPE at compile/fixture time, but nothing
covered its runtime reachability in CI.

Switch the harness-e2e job to the whole orchestrator: `v2-demo.sh --ci` → phases
1-4 + 6. Phase 5/wire auto-skips — the §10.2 agent needs the aiosandbox, which CI
doesn't have, so --ci sets --wire none (the one phase CI genuinely can't run).
Running phases in sequence also means phase 1 registers the master that phase 6
reuses.

Enabler: v2-stage1-demo.sh now auto-skips deploy/email/provision under --ci/$CI
(CI runs against pre-provisioned infra — contracts pinned in TEST_*_HEIMA secrets,
identity via wallet_sig, vault/memory buckets+roles an operator one-shot the CI
role can't recreate). Mirrors stage-1's existing auto-WEBAUTHN-off + stage-2's
auto-stub under --ci, so `v2-demo.sh --ci` drives stage 1 without re-passing the
three skip flags. The build step now builds what v2-demo's preflight expects
(cli + daemon + mcp-server; mock-server is mock-mode-only and unused in real CI).

Docs: harness-ci.yml header + harness/CLAUDE.md CI-role note + the operator
runbook's On-CI semantics. (The runbook already documented `v2-demo.sh --ci` as
the CI front door — this makes the workflow match it.)

NOTE: the harness-e2e job is secret-gated (TEST_OIDC_AWS_ROLE_ARN) and can't run
locally — validated by YAML lint + bash -n + flag-threading review + the drift
gates; needs a CI run with the test secrets to confirm end-to-end.
hanwencheng added a commit that referenced this pull request Jun 6, 2026
…rker chain (#204)

* refactor: #203 agentkeys-backend-client — ONE owner for the broker/worker chain

The broker/worker HTTP chain was hand-coded in three places (MCP HttpBackend,
daemon ui_bridge, harness bash), the structural cause of the #200 drift bugs
(evm_address vs {address,chain_id}, bare-vs-0x omni, per-namespace field
shapes). Collapse it behind one crate so drift is a COMPILE error (Rust callers
share the types) or a FIXTURE mismatch (the harness gate), not a runtime 4xx.

New crate agentkeys-backend-client (the dual of broker-server / worker-*):
- protocol.rs: every cap-mint / worker / audit wire shape, the memory:<ns>
  service builder, and the 0x-omni normalizer (the daemon's old inline bug site)
- client.rs: BackendClient — cap-mint (4 data-class endpoints) -> STS relay ->
  worker put/get -> audit append (the reference impl lifted out of HttpBackend)
- fixtures.rs + dump-protocol-fixtures bin: canonical fixtures serialized from
  the serde types + frozen key-set pins

Collapse the duplicates (net -355 LOC in existing files):
- MCP HttpBackend -> thin delegate over BackendClient; backend wire-shape
  submodules (broker/memory/audit) deleted, re-exported from the crate so the
  Backend trait + InMemoryBackend + tools keep their crate::backend::* paths
- daemon memory_put_real / real_memory_ctx -> call the shared client (kills the
  duplicate cap-mint body + the inline 0x-normalize where the bugs lived)

Enforce (fold-systemic-fixes-into-enforcement):
- scripts/check-backend-fixture-drift.sh: diffs every # @backend-fixture-
  annotated bash body against the crate-emitted fixtures (catches add/rename/drop)
- harness-ci.yml rust-checks runs the fixture --check + the bash gate on every PR
  touching crates/**, harness/**, scripts/**
- root CLAUDE.md + harness/CLAUDE.md "broker/worker shapes have ONE owner" rule;
  arch.md component inventory updated

* refactor: #203 tier-2 — close phase-6's frontend false-green (the #206 parity ladder)

#204 made the broker/worker chain tier-3 (compile-enforced). The adjacent
blind spot the #206 ladder names is the daemon's web API: the route
/v1/master/memory/plant + the ApiMemoryEntry body are hand-copied in 3 places —
the daemon (Rust source of truth), the React frontend daemon.ts, and the harness
web-parity-demo.sh — agreeing only by manual coincidence. A daemon.ts route/shape
change left phase 6 green on the old path (false-green).

Pin all three to one serde source of truth (rung 2 of the ladder):
- daemon: MASTER_MEMORY_{,PLANT_}ROUTE consts (used by the router) + a ui_bridge
  unit test (master_memory_plant_contract_matches_fixture) pinning ApiMemoryEntry's
  keys + the route to harness/fixtures/web-api/master_memory_plant.json
- gate scripts/check-web-api-drift.sh diffs the two NON-Rust consumers (daemon.ts +
  web-parity-demo.sh, both carrying a `@web-fixture: master_memory_plant` annotation)
  against that fixture — route + entry key-set. Wired into harness-ci rust-checks.
- a daemon.ts route rename or entry field add/rename/drop is now CI-red, not a
  stale green (negative-tested both halves).

Docs: update the #206 ladder section in harness/CLAUDE.md (false-green now CLOSED;
plant contract is at rung 2; rung-3 endgame = agentkeys-web-core wasm so daemon.ts
stops hand-building the body); add the web-api gate to the root CLAUDE.md #203 rule.

* fix(harness-ci): brace $SCAN_DIR so the fixture gate survives set -u in a UTF-8 locale

scripts/check-backend-fixture-drift.sh interpolated `$SCAN_DIR…` — the variable
immediately followed by a Unicode ellipsis (U+2026, E2 80 A6). Under `set -u` in a
UTF-8 locale (C.UTF-8 / en_US.UTF-8 — what GitHub ubuntu-latest uses), bash's
multibyte identifier scan absorbs the ellipsis into the name, reads `SCAN_DIR…` as
an unbound variable, and aborts before checking any fixture. The new harness-ci
`rust-checks` step (`bash scripts/check-backend-fixture-drift.sh`) would then fail
on EVERY PR regardless of protocol correctness, and the drift protection never ran.

Reproduced locally: `set -u; V=/tmp; echo "$V…"` exits 1 (`V: unbound variable`)
under LC_ALL=C.UTF-8/en_US.UTF-8 but exits 0 under LC_ALL=C; the braced form
`${V}...` exits 0 under all three. Both gates now pass under LC_ALL=C.UTF-8 +
en_US.UTF-8.

Fix: brace the var (`${SCAN_DIR}`, the CLAUDE.md interpolation-defense convention)
and use ASCII `...` so no following byte can extend the name. Also switched the one
other executable ellipsis log line in check-web-api-drift.sh to ASCII for the same
robustness. Repo-wide scan confirms no other `$VAR<multibyte>` adjacency in
scripts/ or harness/. (Codex adversarial-review finding.)

* fix(harness-ci): tighten both drift gates — call-site route check + unannotated-canonical guard

Two Codex adversarial-review findings, both a residual false-green:

1. Route check passed on stale literals (check-web-api-drift.sh). The web-api
   gate `grep`ed the whole consumer file for the canonical route, so the route
   appearing in a step label / comment satisfied it even if the actual POST URL
   changed — the exact false-green the gate exists to close. Now assert the CALL
   SITE: the route must appear immediately followed by a closing quote (it
   terminates a URL/string literal) within a few lines of a `curl`/`-X POST`
   (bash) or `postJson`/`fetch` (TS) call. A stale label (route followed by a
   space/arrow) no longer satisfies it; a drifted prefix like `…/plantX"` is
   rejected because the char after `plant` is `X`, not a quote. Verified: changing
   the real curl URL while leaving the step label stale now fails.

2. Fixture gate missed an unannotated canonical body (memory-plant-demo.sh:154).
   The `/v1/memory/get` read-back hand-rolled `{cap, namespace}` with no
   `@backend-fixture` annotation, so pass 1 (annotated-only) never gated it.
   Fix both ways: (a) annotate that body; (b) add pass 2 to
   check-backend-fixture-drift.sh — scan EVERY single-quoted jq object literal
   and fail any whose key-set EXACTLY matches a canonical fixture but lacks an
   annotation. Exact-match is false-positive-free: the v2-stage3 cred bodies
   (`{cap, plaintext_b64}`, `{cap}`) and the ttl-omitted 4-key cap variant
   (broker `CapRequest.ttl_seconds` is `#[serde(default)]`) match no canonical set
   and are left alone. Verified: removing the annotation now fails pass 2;
   re-adding passes; no other unannotated canonical bodies exist in the harness.

Both gates pass under LC_ALL=C.UTF-8 + en_US.UTF-8; bash -n clean. (Codex
adversarial-review findings.)

* ci(harness): run the whole v2-demo (--ci) so harness-CI covers phase 6 (web-parity)

harness-ci.yml ran v2-stage{1,2,3}-demo.sh in isolation — it predated the #200
v2-demo restructure and never picked up phase 4 (memory-plant) or phase 6
(web-parity). Phase 6 is the ONLY runtime proof of the daemon's web endpoint
(POST /v1/master/memory/plant → cap-mint → STS → worker → S3, the parent-control
app's path); stage 3 only exercises the CLI/curl path. The #203
check-web-api-drift.sh gate covers its SHAPE at compile/fixture time, but nothing
covered its runtime reachability in CI.

Switch the harness-e2e job to the whole orchestrator: `v2-demo.sh --ci` → phases
1-4 + 6. Phase 5/wire auto-skips — the §10.2 agent needs the aiosandbox, which CI
doesn't have, so --ci sets --wire none (the one phase CI genuinely can't run).
Running phases in sequence also means phase 1 registers the master that phase 6
reuses.

Enabler: v2-stage1-demo.sh now auto-skips deploy/email/provision under --ci/$CI
(CI runs against pre-provisioned infra — contracts pinned in TEST_*_HEIMA secrets,
identity via wallet_sig, vault/memory buckets+roles an operator one-shot the CI
role can't recreate). Mirrors stage-1's existing auto-WEBAUTHN-off + stage-2's
auto-stub under --ci, so `v2-demo.sh --ci` drives stage 1 without re-passing the
three skip flags. The build step now builds what v2-demo's preflight expects
(cli + daemon + mcp-server; mock-server is mock-mode-only and unused in real CI).

Docs: harness-ci.yml header + harness/CLAUDE.md CI-role note + the operator
runbook's On-CI semantics. (The runbook already documented `v2-demo.sh --ci` as
the CI front door — this makes the workflow match it.)

NOTE: the harness-e2e job is secret-gated (TEST_OIDC_AWS_ROLE_ARN) and can't run
locally — validated by YAML lint + bash -n + flag-threading review + the drift
gates; needs a CI run with the test secrets to confirm end-to-end.

* fix(harness): v2-demo preflight must put target/release on PATH (CI phase-1 fix)

The harness-CI switch to `v2-demo.sh --ci` failed at phase 1:
`v2-stage1-demo.sh: line 360: agentkeys: command not found`. The stages call a
BARE `agentkeys` (resolved from PATH). In the old per-stage CI, each stage ran its
own build step (no --skip-build), which installs agentkeys onto PATH via
install-agentkeys-cli.sh. Under v2-demo the preflight builds target/release ONCE
and tells the stages to --skip-build — so they skip the install, and the preflight
never exposed the built binary on PATH. CI has no globally-installed agentkeys, so
the bare call died, cascading: phase 1 died at step 5 (before its register), so the
master was never registered → phase 2 register_first_master also couldn't find
agentkeys-cli → heima-worker-smoke failed.

Fix: the preflight now `export PATH="$PROJECT_ROOT/target/release:$PATH"` right
after the build — so every phase subprocess resolves the just-built agentkeys /
agentkeys-daemon (prepended, so it wins over any stale global install). This is the
missing piece of the preflight's "build once, phases reuse" contract; it helps
operators too (they get the build they just made, not a stale install). Verified
locally: a bare `agentkeys chain show heima` resolves + runs under the exported
PATH (the exact line-360 pattern).

Also (cleanup completeness): the harness-e2e S3 cleanup now also wipes the CONFIG
bucket's bots/<omni>/config/ (phase 6 writes the #201 memory-taxonomy there when
config infra is present). Guarded by [ -n "$CONFIG_BUCKET" ], so it's a no-op
until TEST_CONFIG_BUCKET is set. Memory + creds (vault/memory buckets) were already
wiped; this closes the config-class gap.
hanwencheng added a commit that referenced this pull request Jun 7, 2026
Ties the two existing halves into one ready-to-sign PackedUserOperation:
- intent: agentkeys_core::erc4337::accept_batch_calldata (the atomic
  executeBatch([registerAgentDevice, setScope]), P.2+P.3)
- sponsorship: the broker EIP-191 co-signs the VerifyingPaymaster getHash
  (J1-gated Sybil gate = gas-free), via crate::sponsor (#200 Stage A).

New crates/agentkeys-broker-server/src/sponsored_accept.rs:
- AcceptUserOpParams — every chain-derived value (nonce/gas/fees/validity/addrs)
  is an explicit input (nothing hardcoded; caller reads them on-chain).
- assemble_accept_userop(params, broker_sk) -> AssembledAcceptUserOp { user_op,
  user_op_hash, paymaster_get_hash }. Sets paymasterAndData[20:52] (the gas word)
  provisionally so paymaster_get_hash commits the limits the broker approves, then
  rebuilds paymasterAndData with the real co-sign appended; computes the userOpHash
  the master K11 signs. Pure (broker key only, no chain I/O).

Broker-side because the paymaster co-sign needs the broker key; the daemon will
call this via an endpoint and just K11-sign the returned userOpHash (the #200
division of labour). 3 unit tests: callData==accept batch + sender==master +
empty account sig + deterministic hash; paymasterAndData layout + broker co-sign
recovers to the broker EOA; grant change => userOpHash change. cargo test + clippy green.

Slice 2 of #225. Next: the broker HTTP endpoint wrapping this + the daemon call +
the Stage-B handleOps submit (cast-based, mirrors the E8 proof). Refs #225.
hanwencheng added a commit that referenced this pull request Jun 8, 2026
…point-of-compromise (#223)

* feat: #76 cap-mint K10 proof-of-possession — close the broker single-point-of-compromise

Every cap-mint now carries a K10 device-key signature the worker re-verifies
INDEPENDENTLY of the broker, so a compromised broker (which holds no K10 private
key) cannot mint a usable cap. Closes the §22b.4 stage-1 gap, where the worker
re-checked only the broker's own broker_sig + on-chain device *registration*
(device_key_hash is a public identifier), never possession.

- core: device_crypto::cap_pop_payload (domain-separated, request-bound) +
  cap_pop_now/cap_pop_sig + load_device_key_from_env
- broker: handlers/cap.rs::verify_cap_pop rejects forged/missing/stale client_sig
  (cap_pop_invalid 4xx); §22b.4 shortcut removed
- workers (cred/memory/config/classify): verify::check_client_pop, fail-closed,
  gated by AGENTKEYS_WORKER_REQUIRE_CAP_POP (default enforce, mirrors REQUIRE_STS)
- clients: BackendClient::with_device_key signs the PoP inside cap_mint (MCP,
  daemon ui-bridge, proxy); BrokerCapRequest fields + #203 fixtures regenerated
- master K10/K11 split: harness/scripts/heima-register-master-k10.sh registers the
  master's secp256k1 K10 as a CAP_MINT device (registerAdditionalMasterDevice,
  reusing the #200/#164 K11-assertion machinery), wired into setup-heima step 15
- docs: arch.md §22b.4 resolved + headline guarantee; CLAUDE.md isolation table

Agent path verified: 50 Rust test suites green, clippy clean, backend-fixture gate
green. NEEDS-LIVE-VERIFICATION (no chain / Touch ID here): the on-chain master-K10
registration (cast/ABI + EOA-vs-#164-UserOp msg.sender) and cast EIP-191 matching
device_crypto::eip191_sign in the 2 master-path harness demos.

* style: cargo fmt the #76 K10 PoP code (CI fmt --check)

* fix: #76 make cap-mint K10 PoP optional + graceful (staged rollout)

The harness on test infra caught that hard-requiring the K10 cap-PoP broke the
master-self path: the master registers device_key_hash=keccak(operator_omni)
(the #164 passkey account) and has no secp256k1 K10 registered yet, so master
cap-mints (phase 4 memory-plant, phase 6 web-parity) failed 'master K10 not found'.

Make the PoP optional + verify-when-present (the correct non-breaking staged rollout):
- protocol/broker/worker: client_sig/nonce/ts are Option; a supplied PoP is always
  validated (broker verify_cap_pop + worker), a MISSING PoP is rejected ONLY under
  AGENTKEYS_WORKER_REQUIRE_CAP_POP=1 (default OFF). New verify::enforce_client_pop
  centralizes the gate across the 4 workers.
- clients (BackendClient/ui_bridge/proxy): sign when a K10 is available, else mint
  with no PoP + the caller's device_key_hash — no hard-fail.
- harness master demos: revert to no-PoP bodies (master mints without PoP until its
  K10 is registered); fixture cap_mint_request back to the minimal no-PoP key-set.
- docs (arch §22b.4 + headline, CLAUDE.md): enforcement is a staged flag-flip after
  every actor's K10 (incl. the master's) is registered — that's when the SPOF closes.

The agent path still carries a verified PoP (agents register keccak(K10 addr)).
fmt + clippy + full test suite + fixture gate green locally.

* fix(harness): tolerate email-inbox 5xx (502/503), not just 500, in worker-smoke

The funded harness run cleared the gas failures; the only remaining red was
phase 1 step 15 (worker-smoke email-inbox) returning HTTP 502 — the SAME known
's3:ListBucket IAM not wired on the broker EC2' follow-up the soft-warn already
tolerates as 500, but surfacing via nginx (502) when the worker errors on
ListObjects. The toleration only matched 500, so the 502 variant fell through to
die. Broaden to the 5xx class (500|502|503). Not a #76/code issue — the email
worker /healthz passes; inbox LIST IAM is a separate deploy follow-up.

All code gates + harness phases 2-6 (incl. the #76 cap-PoP path: phase 3 negatives,
phase 4 plant, phase 6 web-parity) already pass on the funded run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant