Problem statement
Between 2026-05-29 02:00 UTC and 07:32 UTC, at least 7 scheduled agentic workflow runs across 6 distinct workflows failed with the same Copilot CLI error:
CAPIError: 429 Maximum effective tokens exceeded (25011516.00 / 25000000)
This is a material expansion of the P1 "token-budget loop" cluster first surfaced in the parent report #35484 (which observed 2 affected workflows). The pattern has tripled in 24 hours and is now the dominant failure mode in the 6h window.
Affected workflows / runs (6h window)
| Workflow |
Run ID |
Time (UTC) |
Symptom |
| PR Sous Chef |
§26623257736 |
07:02 |
4 retries of CAPI 429 → action timed out @ 25m |
| PR Sous Chef |
§26620005382 |
05:31 |
effective_tokens_rate_limit_error set |
| Safe Output Health Monitor |
§26620239212 |
05:38 |
effective_tokens_rate_limit_error set |
| Step Name Alignment |
§26619561645 |
05:18 |
effective_tokens_rate_limit_error set (also tracked in #35644) |
| Copilot CLI Deep Research Agent |
§26619051030 |
05:01 |
effective_tokens_rate_limit_error set + 10 KB body limit hit on create_discussion |
| Go Logger Enhancement |
§26618323959 |
04:39 |
effective_tokens_rate_limit_error set |
| Daily Firewall Logs Collector and Reporter |
§26615980789 |
03:22 |
effective_tokens_rate_limit_error set |
Probable root cause
The Copilot CLI harness retries the agent up to 4 times with --continue after partial failures. When the prior turn already accumulated ≥20 M effective tokens (large MCP tool descriptions + workflow body + tool output history), each retry re-sends the full conversation and crosses the 25 M cap on the next request. The job then either:
- Loops through 4 retries each consuming ~1–2 minutes of 429 backoff (94 s total wait), then times out at the 25-minute step limit, or
- Exits non-zero on attempt 1 and the conclusion step marks the run as failure.
Contributing factors:
- MCP tool list payload is large (full descriptions of
audit, audit-diff, logs, compile, codemod, etc. each ~1 KB).
- Some workflow prompts (e.g. PR Sous Chef triaging 7 PRs) accumulate
gh pr view JSON outputs across iterations.
- Cache hits are high in absolute tokens (3.4 M cached on the PR Sous Chef failure) but cached input still counts against the effective-tokens cap.
Proposed remediation
- Cap per-workflow turn count — set explicit
max-turns on the affected scheduled workflows (suggested 30 for triage workflows, 60 for investigative). Today none of these workflows declare max-turns.
- Reduce MCP tool surface area per workflow — most failing workflows have access to the full
agenticworkflows tool catalog when they only need logs + audit. Use allow-tool lists to keep MCP description payload small.
- Trim conversation between retries — when the harness detects 429 effective-tokens on attempt N, it should pass
--no-resume (or compact) instead of --continue for attempt N+1, so the retry starts from a smaller context window.
- Pre-emptive guard — emit a workflow warning when cumulative tokens cross 20 M (80 % of cap) so the agent can self-truncate with
noop before failing on the next request.
Success criteria / verification
- Over a 24h window after rollout, agentic workflow runs failing with
effective_tokens_rate_limit_error drop below 5 % of completed runs (current rate: ~7 of ~25 scheduled completions in the 6h sample = ~28 %).
- No single workflow contributes >1 token-cap failure in any 6h window.
- PR Sous Chef, Safe Output Health Monitor, and Copilot CLI Deep Research run to completion in ≥ 80 % of scheduled invocations.
Related issues
- Parent: #35484
- Related single-workflow tracker: #35644 (Step Name Alignment 80 % failure rate)
- Related but distinct cause (not token-budget): #35441 (Daily Hippo Learn cache-memory git pack corruption — recurred at 07:37 today, §26624675283, confirming that tracker is still live).
Filed by [aw] Failure Investigator (6h) §26625870457.
Generated by 🔍 [aw] Failure Investigator (6h) · opus47 11.4M · ◷
Recurrence confirmed — 2026-05-30 17:43 UTC (still active)
The 25M effective-tokens cap fired again ~33h after this issue was opened, confirming the token-budget exhaustion pattern is still active and not yet remediated.
| Workflow |
Run |
Time (UTC) |
Symptom |
| Linter Miner |
§26690626184 |
2026-05-30 17:43 |
agent → Execute GitHub Copilot CLI failed; effective_tokens_rate_limit_error set |
Exact 429 signature
429 Maximum effective tokens exceeded (25132364.10 / 25000000).
Run profile: 21.5m, 59 turns, 25.13M effective tokens — same single-run-crosses-the-cap shape described above (no --continue retry needed; one long run exceeded 25M on its own).
Keeping this issue open. (Investigated by the [aw] Failure Investigator 6h window ending 2026-05-30 ~19:10 UTC.)
Generated by 🔍 [aw] Failure Investigator (6h) · opus48 4.1M · ◷
Recurrence — 2026-06-01 13:25 UTC (PR Sous Chef)
This failure mode recurred 3 days after the original report. Exactly one new failed run appears in the 2026-06-01 08:45–14:45 UTC window, and it matches this cluster.
| Workflow |
Run |
Time (UTC) |
Effective tokens |
Proximate failure |
| PR Sous Chef |
§26757738602 |
13:25 |
23,683,854 (94.7% of 25M cap) |
Execute GitHub Copilot CLI timed out @ 25m |
Key deltas vs. original report
- No CAPI 429 this time. Effective tokens peaked at 23.68M — just under the 25M cap — so the proximate failure was the 25-minute step timeout, not the
effective_tokens rate-limit error. The agent was still serially iterating the PR queue (PR 36222 → 36225 → 36230) when wall-clock expired. Token pressure and the 25m timeout are two faces of the same root cause: serial whole-queue processing.
- High run-to-run variance. A sibling PR Sous Chef run §26757759212, triggered 23s later (13:25:30), succeeded with only 10,567,434 effective tokens — ~2.24× fewer than the failed run over the same PR queue. This confirms the failure is load/variance-dependent, not a hard regression.
- Two near-simultaneous PR Sous Chef runs started within 23 seconds (13:25:07 and 13:25:30). Possible duplicate scheduling/trigger; low confidence, flagged for follow-up — concurrent runs over the same PR queue compound token/time pressure.
Evidence — agent step termination (run 26757738602)
2026-06-01T13:54:05.5700647Z ##[error]The action 'Execute GitHub Copilot CLI' has timed out after 25 minutes.
agent_usage.json:
{"input_tokens":3594933,"output_tokens":88094,"effective_tokens":23683854,"primary_model":"gpt-5.4-mini-2026-03-17"}
Job outcomes: agent → failure (27.4m); detection and safe_outputs → skipped; agent_output.json was empty (0 safe items emitted before the timeout).
Confidence & unknowns
Recommendation (unchanged, reinforced): bound PR Sous Chef's per-run work — process the PR queue in capped batches and/or lower the per-run effective-token budget so a single scheduled run cannot approach the 25M cap or the 25m step timeout.
References: §26757738602, §26757759212
Generated by 🔍 [aw] Failure Investigator (6h) · opus48 1.5M · ◷
2026-06-02 re-investigation (6h window 08:10–14:10 UTC) — STILL RECURRING, now cross-engine
Keep this open and prioritize the effective-token rail fix — the 25M cap is still the dominant failure mode 4 days after this issue was filed, and it now breaks the claude engine too, not just copilot. Of 6 agentic failures in the last 6h, 4 are effective-token over-consumption.
Fresh affected runs
| Workflow |
Run |
Engine |
Turns |
Effective tokens |
Symptom |
| daily-experiment-report |
§26810371152 |
copilot |
42 |
25,191,391 |
CAPIError: 429 Maximum effective tokens exceeded (25191390.60 / 25000000) → hard-rail, not retried → exit 1 |
| Package Specification Extractor |
§26815252670 |
copilot |
38 |
25,059,124 |
same 429 hard-rail; also hasNumerousPermissionDenied=true (3 denied bash cmds) |
| Typist - Go Type Analysis |
§26819864149 |
claude |
(0*) |
46,885,956 |
claude-opus ran ~22m then agent job failed; effective tokens ~1.9× the cap |
| Daily AW Cross-Repo Compile Check |
§26812510648 |
claude |
72 |
22,454,280 |
66m run, 34 rate-limit + 33 timeout markers, killed near cap |
* Typist records Turns=0 because the claude-engine stdout parser miscounts turns when the run is killed mid-stream — the run actually produced opus-4-8 assistant turns. Tracking note: the turn-counter under-reports for killed claude runs.
Exact copilot hard-rail signature (both copilot runs)
copilot-harness effective-token rail
Last error: CAPIError: 429 Maximum effective tokens exceeded (25191390.60 / 25000000).
[copilot-harness] attempt 1 failed: ... isMaxEffectiveTokensExceededError=true ...
[copilot-harness] attempt 1: AWF effective-token hard rail hit — not retrying or continuing
Regression delta (audit-diff: failed §26815252670 vs healthy copilot §26815254572 Functional Pragmatist)
The failure is pure over-consumption, not infrastructure — has_anomalies: false, no firewall/MCP status changes.
audit-diff metrics
| Metric |
Healthy |
Failed |
Δ |
| Effective tokens |
3,655,877 |
25,059,124 |
+585% |
| Turns / requests |
7 |
38 |
+31 |
| Tokens per turn |
522K |
659K |
+26% |
| copilot API call volume |
15 |
84 |
+460% |
| cache efficiency |
0 |
0 |
— |
Cache efficiency is 0 on both runs despite 2.29M cache-read tokens on the failed run — the effective-token formula is charging full weight for re-read context. High tokens-per-turn (~660K) + 38 turns indicates the agent re-reads large context every turn rather than narrowing.
Recommended fixes (carry forward)
- Enforce a per-run turn/effective-token soft budget that triggers graceful summarize-and-exit before the 25M hard rail — today the rail aborts mid-task with no safe output.
- Audit context growth per turn for the repeat offenders (daily-experiment-report, Package Specification Extractor, Cross-Repo Compile Check) — 660K tokens/turn with 0 cache efficiency points at full-context re-reads.
- Fix the claude-engine turn counter so killed runs (Typist) don't report
Turns=0 and slip past turn-based detectors.
Correlation
Same root cause as the original report. Cross-engine spread (claude now affected) materially broadens scope. Distinct from #35780 (squid startup) and #36325 (zero-token early CLI exit — spends no tokens). Re-investigation parent run: §26825214886.
Generated by 🔍 [aw] Failure Investigator (6h) · opus48 2M · ◷
Problem statement
Between 2026-05-29 02:00 UTC and 07:32 UTC, at least 7 scheduled agentic workflow runs across 6 distinct workflows failed with the same Copilot CLI error:
This is a material expansion of the P1 "token-budget loop" cluster first surfaced in the parent report #35484 (which observed 2 affected workflows). The pattern has tripled in 24 hours and is now the dominant failure mode in the 6h window.
Affected workflows / runs (6h window)
effective_tokens_rate_limit_errorseteffective_tokens_rate_limit_errorseteffective_tokens_rate_limit_errorset (also tracked in #35644)effective_tokens_rate_limit_errorset + 10 KB body limit hit oncreate_discussioneffective_tokens_rate_limit_errorseteffective_tokens_rate_limit_errorsetProbable root cause
The Copilot CLI harness retries the agent up to 4 times with
--continueafter partial failures. When the prior turn already accumulated ≥20 M effective tokens (large MCP tool descriptions + workflow body + tool output history), each retry re-sends the full conversation and crosses the 25 M cap on the next request. The job then either:Contributing factors:
audit,audit-diff,logs,compile,codemod, etc. each ~1 KB).gh pr viewJSON outputs across iterations.Proposed remediation
max-turnson the affected scheduled workflows (suggested 30 for triage workflows, 60 for investigative). Today none of these workflows declaremax-turns.agenticworkflowstool catalog when they only needlogs+audit. Useallow-toollists to keep MCP description payload small.--no-resume(or compact) instead of--continuefor attempt N+1, so the retry starts from a smaller context window.noopbefore failing on the next request.Success criteria / verification
effective_tokens_rate_limit_errordrop below 5 % of completed runs (current rate: ~7 of ~25 scheduled completions in the 6h sample = ~28 %).Related issues
Filed by [aw] Failure Investigator (6h) §26625870457.
Recurrence confirmed — 2026-05-30 17:43 UTC (still active)
The 25M effective-tokens cap fired again ~33h after this issue was opened, confirming the token-budget exhaustion pattern is still active and not yet remediated.
agent→Execute GitHub Copilot CLIfailed;effective_tokens_rate_limit_errorsetExact 429 signature
Run profile: 21.5m, 59 turns, 25.13M effective tokens — same single-run-crosses-the-cap shape described above (no
--continueretry needed; one long run exceeded 25M on its own).Keeping this issue open. (Investigated by the [aw] Failure Investigator 6h window ending 2026-05-30 ~19:10 UTC.)
Recurrence — 2026-06-01 13:25 UTC (PR Sous Chef)
This failure mode recurred 3 days after the original report. Exactly one new failed run appears in the 2026-06-01 08:45–14:45 UTC window, and it matches this cluster.
Execute GitHub Copilot CLItimed out @ 25mKey deltas vs. original report
effective_tokensrate-limit error. The agent was still serially iterating the PR queue (PR 36222 → 36225 → 36230) when wall-clock expired. Token pressure and the 25m timeout are two faces of the same root cause: serial whole-queue processing.Evidence — agent step termination (run 26757738602)
agent_usage.json:
{"input_tokens":3594933,"output_tokens":88094,"effective_tokens":23683854,"primary_model":"gpt-5.4-mini-2026-03-17"}Job outcomes:
agent→ failure (27.4m);detectionandsafe_outputs→ skipped;agent_output.jsonwas empty (0 safe items emitted before the timeout).Confidence & unknowns
awf-squid) unhealthy → claude engine fails to start (0-turn run failures) #35780 (squid/claude startup) and [aw-failures] Contribution Checksafe_outputsjob fails — agent emitsadd_commentwithtarget: "*"and noissue_number#35984 (Contribution Check safe_outputs) did not execute in this window, so neither could be confirmed fixed or stale from fresh evidence — both remain open.Recommendation (unchanged, reinforced): bound PR Sous Chef's per-run work — process the PR queue in capped batches and/or lower the per-run effective-token budget so a single scheduled run cannot approach the 25M cap or the 25m step timeout.
References: §26757738602, §26757759212
2026-06-02 re-investigation (6h window 08:10–14:10 UTC) — STILL RECURRING, now cross-engine
Keep this open and prioritize the effective-token rail fix — the 25M cap is still the dominant failure mode 4 days after this issue was filed, and it now breaks the
claudeengine too, not justcopilot. Of 6 agentic failures in the last 6h, 4 are effective-token over-consumption.Fresh affected runs
CAPIError: 429 Maximum effective tokens exceeded (25191390.60 / 25000000)→ hard-rail, not retried → exit 1hasNumerousPermissionDenied=true(3 denied bash cmds)agentjob failed; effective tokens ~1.9× the cap* Typist records
Turns=0because the claude-engine stdout parser miscounts turns when the run is killed mid-stream — the run actually produced opus-4-8 assistant turns. Tracking note: the turn-counter under-reports for killed claude runs.Exact copilot hard-rail signature (both copilot runs)
copilot-harness effective-token rail
Regression delta (audit-diff: failed §26815252670 vs healthy copilot §26815254572 Functional Pragmatist)
The failure is pure over-consumption, not infrastructure —
has_anomalies: false, no firewall/MCP status changes.audit-diff metrics
Cache efficiency is 0 on both runs despite 2.29M cache-read tokens on the failed run — the effective-token formula is charging full weight for re-read context. High tokens-per-turn (~660K) + 38 turns indicates the agent re-reads large context every turn rather than narrowing.
Recommended fixes (carry forward)
Turns=0and slip past turn-based detectors.Correlation
Same root cause as the original report. Cross-engine spread (claude now affected) materially broadens scope. Distinct from #35780 (squid startup) and #36325 (zero-token early CLI exit — spends no tokens). Re-investigation parent run: §26825214886.