fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND#5651
fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND#5651Raman369AI wants to merge 1 commit into
Conversation
…CE_NOT_FOUND When load_skill_resource returned RESOURCE_NOT_FOUND, the LLM treated it as a transient soft error and retried the same path indefinitely, producing runaway invocations and large API bills. Two complementary guards are added: 1. Code (LoadSkillResourceTool.run_async): an invocation-scoped retry guard tracks already-attempted (skill, path) pairs in tool_context.state under a temp:_adk_skill_resource_failed_paths_<id> key. The temp: prefix prevents persistence to durable session storage; the invocation_id suffix prevents leakage across invocations on in-memory session backends (where temp keys are not auto-cleared). A second call on the same path within the same invocation returns RESOURCE_NOT_FOUND_FATAL with an explicit stop instruction, giving the LLM an unambiguous terminal signal. 2. Prompt (DEFAULT_SKILL_SYSTEM_INSTRUCTION): adds a rule prohibiting retrying the same path after any error, and a scope boundary clarifying that load_skill_resource is for skill-bundled files only, not for runtime user input (a common source of hallucinated paths). Neither guard alone is sufficient: a code-only stop produces confusing downstream behavior when the LLM has no semantic understanding of why to stop; a prompt-only guard can be ignored under imperfect system prompts. Both layers are required for defense-in-depth. Tests cover: repeat-failure escalation, distinct-path soft errors not escalating, cross-invocation isolation with shared session state, and the temp: prefix on tracking keys.
E2E Reproduction Evidence — RunnerThis comment adds live end-to-end evidence as required by CONTRIBUTING.md § Manual End-to-End Tests (Runner). Minimal Reproduction Agent# test_agent/agent.py
from google.adk.agents import Agent
from google.adk.skills import models
from google.adk.tools.skill_toolset import SkillToolset
greeting_skill = models.Skill(
frontmatter=models.Frontmatter(
name="document-classifier",
description=(
"Classifies the document supplied by the user based on the reference document"
),
),
instructions=(
"Use Document 1 as reference document and Document 2 as input. Classify each document as"
" python-code, non-python-code, or mixed. Return a short comparison."
),
)
root_agent = Agent(
model="gemini-3-flash-preview",
name="root_agent",
description="A repro agent for SkillToolset inline document handling.",
instruction=("Always use the document as the reference and summarize it."),
tools=[SkillToolset(skills=[greeting_skill])],
)Run command: adk webTrigger prompt (any vague document reference is sufficient): What happens (unpatched)The skill instructions reference "Document 1" and "Document 2" as if they are bundled skill resources, but no files are attached to the skill ( From the LLM's perspective, ADK Web trace —
(⚡ = RESOURCE_NOT_FOUND error returned; ✓ = tool call dispatched successfully but returned the error string — the alternating pattern reflects the model immediately re-invoking after each soft failure.) Server log — loop persisting through and beyond CTRL+C: The loop continued firing new LLM requests for ~20 seconds after the first CTRL+C, requiring repeated interrupt signals to force-quit. In-flight API calls are not cancellable — every iteration that completes before the process dies is billed. Why ambiguous prompts make this a framework issue, not a user errorThis reproduction agent is not contrived — skill instructions that reference documents by natural-language names ("Document 1", "reference document") are a normal and expected authoring pattern. The ambiguity is unavoidable: skill authors do not know at write-time whether end-users will supply input inline or whether the model will infer a resource load. The framework must be defensive here because:
The defense-in-depth approach in this PR — |
Fixes #5652
Summary
Closes a latent defect in
SkillToolsetthat lets the LLM enter an unbounded retry loop whenload_skill_resourcereturnsRESOURCE_NOT_FOUND. Because the only existing backstop isRunConfig.max_llm_calls(default 500), a single hallucinated path can quietly burn the entire per-invocation call budget on the same failing tool call before the framework intervenes — andmax_llm_callsis a global cap on legitimate reasoning, not a defense against this specific failure mode.This fix adds invocation-scoped termination to the tool itself, plus a complementary system-prompt rule, so the framework no longer relies on a perfect upstream prompt to avoid unbounded loops on a known error path.
Why this matters
This is a structural problem, not an edge case:
load_skillresponse intentionally omits a manifest of available files — that is the agentskills.io progressive-disclosure contract, and it is correct for token economy. But it means the LLM must infer paths from prose instructions insideSKILL.md. Inferred paths are routinely wrong, even with strong models.RESOURCE_NOT_FOUNDis returned as a structured soft string. From the model's perspective it looks transient and recoverable, exactly like a flaky network error — so it retries the same path. There is no terminal signal in the current implementation.SkillToolsetdistinguishes "first miss" from "500th miss". Every retry is treated identically. The loop terminates only whenmax_llm_calls(default 500) is hit, by which point the entire budget has been spent on one wrong path.load_skill_resource) and runtime user inputs (e.g., a PDF the user is processing). A model that has just been asked to analyze a runtime document will sometimes route that document throughload_skill_resource, hitRESOURCE_NOT_FOUND, and loop on it — even though the file was never a skill resource to begin with.The combination — no manifest, soft error, no terminal signal, no scope boundary — means the loop is reachable by ordinary use of the feature, not just adversarial inputs. Any production user with imperfect prompts is exposed.
What changed
Two layers in one file (
src/google/adk/tools/skill_toolset.py), plus tests:1. Code: invocation-scoped retry guard in
LoadSkillResourceTool.run_asyncFailed
(skill, path)pairs are tracked ontool_context.stateunder the key:temp:prefix uses ADK's existing convention so the value is trimmed from the persisted event delta and never reaches durable session storage.<invocation_id>suffix ensures correctness on in-memory session backends as well, wheretemp:keys are added tosession.stateand are not auto-cleared between invocations. Without the suffix, a path that legitimately failed in invocation A would spuriously hit the fatal path on its first attempt in invocation B.Behavior:
(skill, path)within an invocation → returnsRESOURCE_NOT_FOUND(unchanged).RESOURCE_NOT_FOUND_FATALerror code, with an explicit "do not retry — report the error and stop" message in the error string. This gives the LLM an unambiguous terminal signal.The guard is bounded (one entry per unique missing path), invocation-scoped, and adds no overhead on the success path.
2. Prompt: two additions to
_DEFAULT_SKILL_SYSTEM_INSTRUCTIONload_skill_resourcereturns any error, do not retry the same path. Report the error to the user and stop."load_skill_resourceis only for skill-bundled files inreferences/,assets/, orscripts/— not for runtime user input. This addresses the confounding failure mode above.Why both layers
Defense-in-depth. Code-only termination produces confusing downstream behavior when the LLM has no semantic understanding of why the tool stopped responding the way it expects. Prompt-only termination relies on the LLM following the rule, which imperfect upstream prompts can override or contradict. Together they are robust.
Considered and rejected
max_llm_callsafter_tool_callbackworkaroundSkillToolset; the framework still ships with the loopavailable_resourcesmanifest to the L2load_skillresponselist_skill_resourcestoolTest plan
New tests in
tests/unittests/tools/test_skill_toolset.py:test_load_resource_repeated_failure_escalates_to_fatal— second call on the same missing path returnsRESOURCE_NOT_FOUND_FATALwith an explicit stop instruction.test_load_resource_different_paths_each_soft_fail— distinct missing paths each return the soft error (no over-eager cross-path escalation).test_load_resource_failures_isolated_per_invocation— proves the<invocation_id>scoping: failures from invocation A do not escalate the first attempt in invocation B even when sharing the same session state dict.test_load_resource_failed_paths_use_temp_prefix— proves thetemp:prefix invariant so tracking keys are never persisted to durable storage.Verification:
pyink --checkclean on both modified files.isort --check-onlyclean on both modified files.mypy src/google/adk/tools/skill_toolset.pyreports the same 17 pre-existing errors asmain; zero new errors introduced.Backwards compatibility
RESOURCE_NOT_FOUNDerror code, same error string. Existing callers and tests that key off this code see no difference.RESOURCE_NOT_FOUND_FATALcode is purely additive.temp:prefix and is therefore guaranteed not to leak into persisted session storage.