feat(skill): spawn-cloud-swarm skill + connect-claude installer (PR 7b relay-side)#866
feat(skill): spawn-cloud-swarm skill + connect-claude installer (PR 7b relay-side)#866kjgbot wants to merge 4 commits into
Conversation
Adds the spawn-cloud-swarm Claude skill that orchestrates the
local-mount lifecycle around N cloud.agent.spawn calls. This is the
relay-side half of PR 7b (parent spec
cloud/specs/mcp-cloud-spawn/pr-07b-spawn-cloud-swarm-skill-and-mount-tools.md);
the sibling relaycast PR (feat/cloud-local-mount-tools) must merge
first because the skill references cloud.local-mount.{ensure,status,stop}.
Files:
- skills/spawn-cloud-swarm/SKILL.md: procedure document
- prpm.json: register the skill so prpm publishes it
Still to land in follow-up commits before opening the PR:
- src/cli/lib/install-skill.ts + test (file-drop helper into ~/.claude/skills)
- src/cli/lib/mcp-preflight.ts + test (fail-closed if relaycast MCP lacks the tools)
- src/cli/commands/cloud.ts hook to call installSkill in `cloud connect claude`
- src/cli/commands/cloud.test.ts coverage of the wiring
- test/skills/spawn-cloud-swarm.test.ts procedural test against mocked MCP
Ships the CLI-side wiring for the spawn-cloud-swarm skill (PR 7b
relay-side slice):
- src/cli/lib/install-skill.ts + tests: idempotent fs primitive that
drops bundled SKILL.md at ~/.claude/skills/<name>/SKILL.md (mode
0644, parent dir 0755). Skips writes when the dest hash already
matches; overwrites on hash diff.
- src/cli/lib/mcp-preflight.ts + tests: fails closed with verbatim
remediation when the local relaycast MCP omits any of
cloud.local-mount.{ensure,status,stop}.
- src/cli/commands/cloud.ts: after a successful connectProvider() with
normalized provider 'anthropic', drops the skill via installSkill().
Install failure logs a warning and does not fail the connect flow.
Extends the cloud_auth telemetry event with skill_installed: bool.
- src/cli/commands/cloud.test.ts: covers the claude-branch wiring, the
no-op for other providers, the skip on connect failure, the
resilience to skill-install errors, and the --help description.
- test/skills/spawn-cloud-swarm.test.ts: drives a faithful
programmatic re-implementation of the SKILL.md procedure through a
mocked MCP harness, asserting the call sequence (ensure → spawn×N →
poll status+list → stop only on opt-in), the QUOTA_EXCEEDED 10s × 3
backoff, the persist-by-default teardown, and that SKILL.md
references every required tool by exact name.
- package.json: adds skills/ to the npm files array so the bundled
SKILL.md ships with the published package.
- CHANGELOG.md + README.md: one-line entries documenting the skill
and the install hook.
Cross-repo merge order: this slice depends on the sibling relaycast
PR feat/cloud-local-mount-tools landing first.
Closes Finding 1 from the final fresh-eyes review. The MCP preflight helper (src/cli/lib/mcp-preflight.ts) cannot be wired into `cloud connect` itself because the relaycast MCP runs inside the user's Claude Code / Codex runtime, not inside the agent-relay CLI process. Instead, this commit moves the preflight into the skill body — where an MCP client actually exists — by: - Adding Prereq #4 to skills/spawn-cloud-swarm/SKILL.md that requires `cloud.local-mount.{ensure,status,stop}` to be present on the MCP surface, and quotes the verbatim remediation string. - Adding an MCP_LOCAL_MOUNT_TOOLS_MISSING row to the Error Handling table with the same verbatim remediation. - Asserting in test/skills/spawn-cloud-swarm.test.ts that the SKILL.md body contains the verbatim MCP_PREFLIGHT_REMEDIATION constant and every entry of REQUIRED_CLOUD_LOCAL_MOUNT_TOOLS exported from mcp-preflight.ts, so the two surfaces cannot drift. The pure helper in mcp-preflight.ts remains the source of truth for the required-tool tuple and the remediation string; the SKILL.md references the same strings verbatim and the new static-check test binds them. Tests: 241 passed | 5 skipped across src/cli/ + test/skills/; tsc --noEmit exit 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final iteration of the spawn-cloud-swarm skill install flow: - agent-relay cloud connect claude installs the skill into ~/.claude/skills/spawn-cloud-swarm/ - SKILL.md captures the procedure: cloud.status preflight, ensure local relayfile mount, pick/create channel, spawn N agents, subscribe channel, on stop kill agents and (by default) leave mount running Salvaged manually after ricky workflow runtime hung before the github-primitive shipping step.
📝 WalkthroughWalkthroughThis PR introduces the ChangesCloud Swarm Skill Feature
Sequence Diagram(s)No sequence diagrams generated. While the PR introduces new workflows, the interactions are best understood through the specification document and test harness mocks rather than a single high-level sequence diagram. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes The PR spans multiple files implementing a cohesive skill feature with clear logical checkpoints (specification → validation → installation → integration → testing). The changes are heterogeneous across new modules, CLI modifications, and comprehensive test suites. Logic density is moderate, with the most complex elements being idempotent installation with hash comparison and backoff retry semantics in tests. The specification document requires careful review to ensure procedural correctness. Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
| const fileStat = await fs.stat(result.destPath); | ||
| expect(fileStat.mode & 0o777).toBe(0o644); | ||
|
|
||
| const written = await fs.readFile(result.destPath, 'utf-8'); |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cli/commands/cloud.ts`:
- Around line 362-365: The catch block that logs install failure for the
"spawn-cloud-swarm" skill should not assume the thrown value is an Error; update
the deps.log call inside the catch(skillErr) to derive the message safely (e.g.,
use skillErr instanceof Error ? skillErr.message : String(skillErr)) so
non-Error throwables (null, string, etc.) won't cause a secondary exception;
adjust the log line that currently casts to Error and reference the same context
message used today.
In `@test/skills/spawn-cloud-swarm.test.ts`:
- Around line 141-146: The polling loop exits too early because it breaks when
either the mount is stopped OR when no agents are running; change the
termination to require both conditions before breaking. In the loop that calls
harness.status() and harness.list() (variables mountStatus and agents, loop
using opts.pollIterations), replace the two separate if-break checks with a
single combined check that breaks only when mountStatus.running is false AND
agents.every((a) => a.status !== 'running') is true so the poll continues until
both the mount is not running and no agents remain running.
- Around line 125-139: The test uses a single shared backoffWaits counter
causing earlier calls to harness.spawn to consume retries for all workers; move
the retry budget to be per-worker by declaring and initializing backoffWaits
inside the worker loop (e.g., inside the for loop body) or otherwise reset it
before each spawn attempt, and keep the same catch logic that checks code ===
'QUOTA_EXCEEDED' and calls await opts.sleep(10_000) up to three times per
worker.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: f3f3b482-046e-475e-a69a-6306a6930468
📒 Files selected for processing (12)
CHANGELOG.mdREADME.mdpackage.jsonprpm.jsonskills/spawn-cloud-swarm/SKILL.mdsrc/cli/commands/cloud.test.tssrc/cli/commands/cloud.tssrc/cli/lib/install-skill.test.tssrc/cli/lib/install-skill.tssrc/cli/lib/mcp-preflight.test.tssrc/cli/lib/mcp-preflight.tstest/skills/spawn-cloud-swarm.test.ts
| } catch (skillErr) { | ||
| deps.log( | ||
| `warning: failed to install spawn-cloud-swarm skill (${(skillErr as Error).message ?? 'unknown error'}); connect succeeded.` | ||
| ); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Read-only verification: locate unsafe casted .message access in catch paths.
rg -n -C2 '\(.* as Error\)\.message' src/cli/commands/cloud.tsRepository: AgentWorkforce/relay
Length of output: 303
🏁 Script executed:
#!/bin/bash
# Search for other catch blocks with unsafe casts in the same file
rg -n 'catch|instanceof|as Error' src/cli/commands/cloud.ts | head -30Repository: AgentWorkforce/relay
Length of output: 610
🏁 Script executed:
#!/bin/bash
# Check the context around line 677 to confirm the safe pattern is used similarly
sed -n '673,680p' src/cli/commands/cloud.tsRepository: AgentWorkforce/relay
Length of output: 403
Guard the warning path against non-Error throwables.
Line 364 dereferences .message from a cast value. If installSkill throws null or a string, this catch block can throw and break the fail-open behavior. The codebase already uses the safe pattern elsewhere (line 677): err instanceof Error ? err.message : String(err).
💡 Suggested fix
} catch (skillErr) {
+ const skillMessage =
+ skillErr instanceof Error ? skillErr.message : String(skillErr ?? 'unknown error');
deps.log(
- `warning: failed to install spawn-cloud-swarm skill (${(skillErr as Error).message ?? 'unknown error'}); connect succeeded.`
+ `warning: failed to install spawn-cloud-swarm skill (${skillMessage}); connect succeeded.`
);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } catch (skillErr) { | |
| deps.log( | |
| `warning: failed to install spawn-cloud-swarm skill (${(skillErr as Error).message ?? 'unknown error'}); connect succeeded.` | |
| ); | |
| } catch (skillErr) { | |
| const skillMessage = | |
| skillErr instanceof Error ? skillErr.message : String(skillErr ?? 'unknown error'); | |
| deps.log( | |
| `warning: failed to install spawn-cloud-swarm skill (${skillMessage}); connect succeeded.` | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cli/commands/cloud.ts` around lines 362 - 365, The catch block that logs
install failure for the "spawn-cloud-swarm" skill should not assume the thrown
value is an Error; update the deps.log call inside the catch(skillErr) to derive
the message safely (e.g., use skillErr instanceof Error ? skillErr.message :
String(skillErr)) so non-Error throwables (null, string, etc.) won't cause a
secondary exception; adjust the log line that currently casts to Error and
reference the same context message used today.
| let backoffWaits = 0; | ||
| for (let i = 0; i < opts.workers; ) { | ||
| try { | ||
| await harness.spawn({ workspaceId: 'ws-1' }); | ||
| i += 1; | ||
| } catch (err) { | ||
| const code = (err as Error & { code?: string }).code; | ||
| if (code === 'QUOTA_EXCEEDED' && backoffWaits < 3) { | ||
| backoffWaits += 1; | ||
| await opts.sleep(10_000); | ||
| continue; | ||
| } | ||
| throw err; | ||
| } | ||
| } |
There was a problem hiding this comment.
Reset quota retry budget per worker, not globally.
At Line 125, backoffWaits is shared across all workers. That means earlier workers can consume retries and later workers may fail before getting their own 3 backoff cycles.
🔧 Proposed fix
- let backoffWaits = 0;
- for (let i = 0; i < opts.workers; ) {
- try {
- await harness.spawn({ workspaceId: 'ws-1' });
- i += 1;
- } catch (err) {
- const code = (err as Error & { code?: string }).code;
- if (code === 'QUOTA_EXCEEDED' && backoffWaits < 3) {
- backoffWaits += 1;
- await opts.sleep(10_000);
- continue;
- }
- throw err;
- }
- }
+ for (let i = 0; i < opts.workers; i += 1) {
+ let backoffWaits = 0;
+ // one retry budget per worker spawn
+ // eslint-disable-next-line no-constant-condition
+ while (true) {
+ try {
+ await harness.spawn({ workspaceId: 'ws-1' });
+ break;
+ } catch (err) {
+ const code = (err as Error & { code?: string }).code;
+ if (code === 'QUOTA_EXCEEDED' && backoffWaits < 3) {
+ backoffWaits += 1;
+ await opts.sleep(10_000);
+ continue;
+ }
+ throw err;
+ }
+ }
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/skills/spawn-cloud-swarm.test.ts` around lines 125 - 139, The test uses
a single shared backoffWaits counter causing earlier calls to harness.spawn to
consume retries for all workers; move the retry budget to be per-worker by
declaring and initializing backoffWaits inside the worker loop (e.g., inside the
for loop body) or otherwise reset it before each spawn attempt, and keep the
same catch logic that checks code === 'QUOTA_EXCEEDED' and calls await
opts.sleep(10_000) up to three times per worker.
| for (let i = 0; i < opts.pollIterations; i += 1) { | ||
| const mountStatus = await harness.status({ localDir: opts.localDir }); | ||
| const agents = await harness.list(); | ||
| if (!mountStatus.running) break; | ||
| if (agents.every((a) => a.status !== 'running')) break; | ||
| } |
There was a problem hiding this comment.
Polling termination condition is too permissive.
Line 144 and Line 145 break on either condition. The procedure contract in this PR context is to poll until mount is not running and no agents are running; current logic can exit early.
🔧 Proposed fix
- if (!mountStatus.running) break;
- if (agents.every((a) => a.status !== 'running')) break;
+ if (!mountStatus.running && agents.every((a) => a.status !== 'running')) break;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (let i = 0; i < opts.pollIterations; i += 1) { | |
| const mountStatus = await harness.status({ localDir: opts.localDir }); | |
| const agents = await harness.list(); | |
| if (!mountStatus.running) break; | |
| if (agents.every((a) => a.status !== 'running')) break; | |
| } | |
| for (let i = 0; i < opts.pollIterations; i += 1) { | |
| const mountStatus = await harness.status({ localDir: opts.localDir }); | |
| const agents = await harness.list(); | |
| if (!mountStatus.running && agents.every((a) => a.status !== 'running')) break; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/skills/spawn-cloud-swarm.test.ts` around lines 141 - 146, The polling
loop exits too early because it breaks when either the mount is stopped OR when
no agents are running; change the termination to require both conditions before
breaking. In the loop that calls harness.status() and harness.list() (variables
mountStatus and agents, loop using opts.pollIterations), replace the two
separate if-break checks with a single combined check that breaks only when
mountStatus.running is false AND agents.every((a) => a.status !== 'running') is
true so the poll continues until both the mount is not running and no agents
remain running.
| 4. The relaycast MCP surface exposes every tool this skill drives. Before | ||
| the prereq checks above, confirm `cloud.local-mount.ensure`, | ||
| `cloud.local-mount.status`, and `cloud.local-mount.stop` are all | ||
| registered on the MCP client. If any are missing, surface | ||
| `MCP_LOCAL_MOUNT_TOOLS_MISSING` and the verbatim remediation: "Upgrade | ||
| `@relaycast/mcp` to a build that includes `cloud.local-mount.*` (see | ||
| relaycast PR `feat/cloud-local-mount-tools`)." Then stop. |
There was a problem hiding this comment.
🟡 SKILL.md prereq 4 contradicts its own ordering — MCP tool check listed last but says "before the prereq checks above"
Prereq 4 (line 40) explicitly says "Before the prereq checks above, confirm cloud.local-mount.ensure, cloud.local-mount.status, and cloud.local-mount.stop are all registered on the MCP client." However, it is numbered as step 4 — after steps 1–3. This is directly contradictory: an agent following the numbered list will execute step 3 (which calls cloud.local-mount.status) before step 4 confirms that tool is even registered. If the MCP tools are missing, the agent gets a confusing tool-not-found error at step 3 instead of the intended MCP_LOCAL_MOUNT_TOOLS_MISSING remediation from step 4. Since this SKILL.md is an LLM-consumed instruction set, the contradictory ordering is a functional bug in the agent's procedure.
Prompt for agents
The MCP tool availability check (currently prereq 4) explicitly says it should run "before the prereq checks above" but is listed as the last numbered step. This contradicts the numbered ordering and causes step 3 to call cloud.local-mount.status before step 4 confirms it exists.
To fix: Move the MCP tool availability check to be prereq 1 (renumber current 1→2, 2→3, 3→4), or restructure the text so the "before" instruction is removed and the check naturally fits as a later step that doesn't depend on ordering. The simplest fix is to renumber it as step 0 or move it above the current step 1.
Files to change: skills/spawn-cloud-swarm/SKILL.md, specifically the Prereqs the Skill Enforces section (lines 22-49).
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
2 issues found across 12 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/spawn-cloud-swarm/SKILL.md">
<violation number="1" location="skills/spawn-cloud-swarm/SKILL.md:40">
P2: Prereq 4 says it should be checked "before the prereq checks above" but is listed last. Move this check to position 1 (or renumber) so tool-availability is verified before any MCP calls are attempted in the subsequent prereqs.</violation>
</file>
<file name="src/cli/commands/cloud.ts">
<violation number="1" location="src/cli/commands/cloud.ts:364">
P2: Unsafe cast `(skillErr as Error).message` will throw a `TypeError` if `installSkill` rejects with `null` or a non-object value, breaking the fail-open intent (connect should still succeed). Use the safe pattern already in this file: `skillErr instanceof Error ? skillErr.message : String(skillErr)`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Fix all with cubic | Re-trigger cubic
| `cloud.local-mount.status { localDir: CWD }` — if the MCP reports the | ||
| directory is not registered, surface `NEEDS_RELAYFILE_SETUP` and instruct | ||
| the user to run `relayfile setup --local-dir <CWD>`. Then stop. | ||
| 4. The relaycast MCP surface exposes every tool this skill drives. Before |
There was a problem hiding this comment.
P2: Prereq 4 says it should be checked "before the prereq checks above" but is listed last. Move this check to position 1 (or renumber) so tool-availability is verified before any MCP calls are attempted in the subsequent prereqs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/spawn-cloud-swarm/SKILL.md, line 40:
<comment>Prereq 4 says it should be checked "before the prereq checks above" but is listed last. Move this check to position 1 (or renumber) so tool-availability is verified before any MCP calls are attempted in the subsequent prereqs.</comment>
<file context>
@@ -0,0 +1,113 @@
+ `cloud.local-mount.status { localDir: CWD }` — if the MCP reports the
+ directory is not registered, surface `NEEDS_RELAYFILE_SETUP` and instruct
+ the user to run `relayfile setup --local-dir <CWD>`. Then stop.
+4. The relaycast MCP surface exposes every tool this skill drives. Before
+ the prereq checks above, confirm `cloud.local-mount.ensure`,
+ `cloud.local-mount.status`, and `cloud.local-mount.stop` are all
</file context>
| } | ||
| } catch (skillErr) { | ||
| deps.log( | ||
| `warning: failed to install spawn-cloud-swarm skill (${(skillErr as Error).message ?? 'unknown error'}); connect succeeded.` |
There was a problem hiding this comment.
P2: Unsafe cast (skillErr as Error).message will throw a TypeError if installSkill rejects with null or a non-object value, breaking the fail-open intent (connect should still succeed). Use the safe pattern already in this file: skillErr instanceof Error ? skillErr.message : String(skillErr).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/cli/commands/cloud.ts, line 364:
<comment>Unsafe cast `(skillErr as Error).message` will throw a `TypeError` if `installSkill` rejects with `null` or a non-object value, breaking the fail-open intent (connect should still succeed). Use the safe pattern already in this file: `skillErr instanceof Error ? skillErr.message : String(skillErr)`.</comment>
<file context>
@@ -330,6 +342,29 @@ export function registerCloudCommands(program: Command, overrides: Partial<Cloud
+ }
+ } catch (skillErr) {
+ deps.log(
+ `warning: failed to install spawn-cloud-swarm skill (${(skillErr as Error).message ?? 'unknown error'}); connect succeeded.`
+ );
+ }
</file context>
PR 7b (relay side) of `specs/mcp-cloud-spawn/pr-07b-spawn-cloud-swarm-skill-and-mount-tools.md`.
Ships the `spawn-cloud-swarm` skill into `~/.claude/skills/` when the user runs `agent-relay cloud connect claude`. The skill encapsulates the procedure for spawning multiple cloud agents on a shared relay channel: cloud.status preflight, ensure local relayfile mount, pick or create channel, spawn N agents, subscribe to the channel, and on stop kill the agents (by default leaving the mount running).
This spec is the deliberately-dual-repo exception in the parent spec — see specs/mcp-cloud-spawn/pr-07b-. The sibling `cloud.local-mount.` MCP tools need to ship as a separate relaycast PR; this PR alone is not useful until that lands.
Provenance: authored by a ricky workflow in worktree `/private/tmp/relay-spawn-cloud-swarm-skill`; commit + push salvaged manually after ricky's runtime-launch hung past the github-primitive shipping step (same pattern as cloud#688/#691/#692/#703 and relay#862/#863).
🤖 Generated with Claude Code