Skip to content

[aw-failures] Token-budget exhaustion (25M effective-tokens cap) recurring across 6+ scheduled workflows — 2026-05-29 02:00–07:32 UTC #35661

@github-actions

Description

@github-actions

Problem statement

Between 2026-05-29 02:00 UTC and 07:32 UTC, at least 7 scheduled agentic workflow runs across 6 distinct workflows failed with the same Copilot CLI error:

CAPIError: 429 Maximum effective tokens exceeded (25011516.00 / 25000000)

This is a material expansion of the P1 "token-budget loop" cluster first surfaced in the parent report #35484 (which observed 2 affected workflows). The pattern has tripled in 24 hours and is now the dominant failure mode in the 6h window.

Affected workflows / runs (6h window)

Workflow Run ID Time (UTC) Symptom
PR Sous Chef §26623257736 07:02 4 retries of CAPI 429 → action timed out @ 25m
PR Sous Chef §26620005382 05:31 effective_tokens_rate_limit_error set
Safe Output Health Monitor §26620239212 05:38 effective_tokens_rate_limit_error set
Step Name Alignment §26619561645 05:18 effective_tokens_rate_limit_error set (also tracked in #35644)
Copilot CLI Deep Research Agent §26619051030 05:01 effective_tokens_rate_limit_error set + 10 KB body limit hit on create_discussion
Go Logger Enhancement §26618323959 04:39 effective_tokens_rate_limit_error set
Daily Firewall Logs Collector and Reporter §26615980789 03:22 effective_tokens_rate_limit_error set

Probable root cause

The Copilot CLI harness retries the agent up to 4 times with --continue after partial failures. When the prior turn already accumulated ≥20 M effective tokens (large MCP tool descriptions + workflow body + tool output history), each retry re-sends the full conversation and crosses the 25 M cap on the next request. The job then either:

  1. Loops through 4 retries each consuming ~1–2 minutes of 429 backoff (94 s total wait), then times out at the 25-minute step limit, or
  2. Exits non-zero on attempt 1 and the conclusion step marks the run as failure.

Contributing factors:

  • MCP tool list payload is large (full descriptions of audit, audit-diff, logs, compile, codemod, etc. each ~1 KB).
  • Some workflow prompts (e.g. PR Sous Chef triaging 7 PRs) accumulate gh pr view JSON outputs across iterations.
  • Cache hits are high in absolute tokens (3.4 M cached on the PR Sous Chef failure) but cached input still counts against the effective-tokens cap.

Proposed remediation

  1. Cap per-workflow turn count — set explicit max-turns on the affected scheduled workflows (suggested 30 for triage workflows, 60 for investigative). Today none of these workflows declare max-turns.
  2. Reduce MCP tool surface area per workflow — most failing workflows have access to the full agenticworkflows tool catalog when they only need logs + audit. Use allow-tool lists to keep MCP description payload small.
  3. Trim conversation between retries — when the harness detects 429 effective-tokens on attempt N, it should pass --no-resume (or compact) instead of --continue for attempt N+1, so the retry starts from a smaller context window.
  4. Pre-emptive guard — emit a workflow warning when cumulative tokens cross 20 M (80 % of cap) so the agent can self-truncate with noop before failing on the next request.

Success criteria / verification

  • Over a 24h window after rollout, agentic workflow runs failing with effective_tokens_rate_limit_error drop below 5 % of completed runs (current rate: ~7 of ~25 scheduled completions in the 6h sample = ~28 %).
  • No single workflow contributes >1 token-cap failure in any 6h window.
  • PR Sous Chef, Safe Output Health Monitor, and Copilot CLI Deep Research run to completion in ≥ 80 % of scheduled invocations.

Related issues

  • Parent: #35484
  • Related single-workflow tracker: #35644 (Step Name Alignment 80 % failure rate)
  • Related but distinct cause (not token-budget): #35441 (Daily Hippo Learn cache-memory git pack corruption — recurred at 07:37 today, §26624675283, confirming that tracker is still live).

Filed by [aw] Failure Investigator (6h) §26625870457.

Generated by 🔍 [aw] Failure Investigator (6h) · opus47 11.4M ·

  • expires on Jun 5, 2026, 8:18 AM UTC

Recurrence confirmed — 2026-05-30 17:43 UTC (still active)

The 25M effective-tokens cap fired again ~33h after this issue was opened, confirming the token-budget exhaustion pattern is still active and not yet remediated.

Workflow Run Time (UTC) Symptom
Linter Miner §26690626184 2026-05-30 17:43 agentExecute GitHub Copilot CLI failed; effective_tokens_rate_limit_error set
Exact 429 signature
429 Maximum effective tokens exceeded (25132364.10 / 25000000).

Run profile: 21.5m, 59 turns, 25.13M effective tokens — same single-run-crosses-the-cap shape described above (no --continue retry needed; one long run exceeded 25M on its own).

Keeping this issue open. (Investigated by the [aw] Failure Investigator 6h window ending 2026-05-30 ~19:10 UTC.)

Generated by 🔍 [aw] Failure Investigator (6h) · opus48 4.1M ·


Recurrence — 2026-06-01 13:25 UTC (PR Sous Chef)

This failure mode recurred 3 days after the original report. Exactly one new failed run appears in the 2026-06-01 08:45–14:45 UTC window, and it matches this cluster.

Workflow Run Time (UTC) Effective tokens Proximate failure
PR Sous Chef §26757738602 13:25 23,683,854 (94.7% of 25M cap) Execute GitHub Copilot CLI timed out @ 25m

Key deltas vs. original report

  1. No CAPI 429 this time. Effective tokens peaked at 23.68M — just under the 25M cap — so the proximate failure was the 25-minute step timeout, not the effective_tokens rate-limit error. The agent was still serially iterating the PR queue (PR 36222 → 36225 → 36230) when wall-clock expired. Token pressure and the 25m timeout are two faces of the same root cause: serial whole-queue processing.
  2. High run-to-run variance. A sibling PR Sous Chef run §26757759212, triggered 23s later (13:25:30), succeeded with only 10,567,434 effective tokens — ~2.24× fewer than the failed run over the same PR queue. This confirms the failure is load/variance-dependent, not a hard regression.
  3. Two near-simultaneous PR Sous Chef runs started within 23 seconds (13:25:07 and 13:25:30). Possible duplicate scheduling/trigger; low confidence, flagged for follow-up — concurrent runs over the same PR queue compound token/time pressure.
Evidence — agent step termination (run 26757738602)
2026-06-01T13:54:05.5700647Z ##[error]The action 'Execute GitHub Copilot CLI' has timed out after 25 minutes.

agent_usage.json:

{"input_tokens":3594933,"output_tokens":88094,"effective_tokens":23683854,"primary_model":"gpt-5.4-mini-2026-03-17"}

Job outcomes: agentfailure (27.4m); detection and safe_outputs → skipped; agent_output.json was empty (0 safe items emitted before the timeout).

Confidence & unknowns

Recommendation (unchanged, reinforced): bound PR Sous Chef's per-run work — process the PR queue in capped batches and/or lower the per-run effective-token budget so a single scheduled run cannot approach the 25M cap or the 25m step timeout.

References: §26757738602, §26757759212

Generated by 🔍 [aw] Failure Investigator (6h) · opus48 1.5M ·


2026-06-02 re-investigation (6h window 08:10–14:10 UTC) — STILL RECURRING, now cross-engine

Keep this open and prioritize the effective-token rail fix — the 25M cap is still the dominant failure mode 4 days after this issue was filed, and it now breaks the claude engine too, not just copilot. Of 6 agentic failures in the last 6h, 4 are effective-token over-consumption.

Fresh affected runs

Workflow Run Engine Turns Effective tokens Symptom
daily-experiment-report §26810371152 copilot 42 25,191,391 CAPIError: 429 Maximum effective tokens exceeded (25191390.60 / 25000000) → hard-rail, not retried → exit 1
Package Specification Extractor §26815252670 copilot 38 25,059,124 same 429 hard-rail; also hasNumerousPermissionDenied=true (3 denied bash cmds)
Typist - Go Type Analysis §26819864149 claude (0*) 46,885,956 claude-opus ran ~22m then agent job failed; effective tokens ~1.9× the cap
Daily AW Cross-Repo Compile Check §26812510648 claude 72 22,454,280 66m run, 34 rate-limit + 33 timeout markers, killed near cap

* Typist records Turns=0 because the claude-engine stdout parser miscounts turns when the run is killed mid-stream — the run actually produced opus-4-8 assistant turns. Tracking note: the turn-counter under-reports for killed claude runs.

Exact copilot hard-rail signature (both copilot runs)

copilot-harness effective-token rail
Last error: CAPIError: 429 Maximum effective tokens exceeded (25191390.60 / 25000000).
[copilot-harness] attempt 1 failed: ... isMaxEffectiveTokensExceededError=true ...
[copilot-harness] attempt 1: AWF effective-token hard rail hit — not retrying or continuing

Regression delta (audit-diff: failed §26815252670 vs healthy copilot §26815254572 Functional Pragmatist)

The failure is pure over-consumption, not infrastructure — has_anomalies: false, no firewall/MCP status changes.

audit-diff metrics
Metric Healthy Failed Δ
Effective tokens 3,655,877 25,059,124 +585%
Turns / requests 7 38 +31
Tokens per turn 522K 659K +26%
copilot API call volume 15 84 +460%
cache efficiency 0 0

Cache efficiency is 0 on both runs despite 2.29M cache-read tokens on the failed run — the effective-token formula is charging full weight for re-read context. High tokens-per-turn (~660K) + 38 turns indicates the agent re-reads large context every turn rather than narrowing.

Recommended fixes (carry forward)

  1. Enforce a per-run turn/effective-token soft budget that triggers graceful summarize-and-exit before the 25M hard rail — today the rail aborts mid-task with no safe output.
  2. Audit context growth per turn for the repeat offenders (daily-experiment-report, Package Specification Extractor, Cross-Repo Compile Check) — 660K tokens/turn with 0 cache efficiency points at full-context re-reads.
  3. Fix the claude-engine turn counter so killed runs (Typist) don't report Turns=0 and slip past turn-based detectors.

Correlation

Same root cause as the original report. Cross-engine spread (claude now affected) materially broadens scope. Distinct from #35780 (squid startup) and #36325 (zero-token early CLI exit — spends no tokens). Re-investigation parent run: §26825214886.

Generated by 🔍 [aw] Failure Investigator (6h) · opus48 2M ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions