Skip to content

Proposal: Add built-in tool_parameter_match evaluator #5643

@fabrizioamort

Description

@fabrizioamort

🔴 Required Information

Is your feature request related to a specific problem?

ADK does not currently expose a dedicated built-in metric for tool argument quality. Existing trajectory metrics tell you whether a tool was called, but not whether the agent supplied the right parameters. This hides important failure modes:

  • correct tool name with incorrect required arguments
  • partial argument correctness — values that are close but not exact
  • repeated same-tool calls that require stable one-to-one alignment between expected and actual invocations before argument-level scoring is meaningful

For regression tracking during agent development, "the right tool was called" is necessary but not sufficient. Without argument-level scoring, agents that systematically miscall a correctly-chosen tool produce no signal in built-in metrics.

Describe the Solution You'd Like

A new built-in evaluator named tool_parameter_match that scores the quality of tool-call arguments after deterministic alignment between expected and actual invocations.

  • Metric name: tool_parameter_match
  • Criterion type: ToolParameterMatchCriterion(BaseCriterion)
  • Match modes: name_only, name_and_args, name_and_required_args (alignment behavior, mirroring trajectory scoring)
  • Argument strategies: exact, casefold_exact, numeric, contains
  • Per-argument strategy override: optional per_arg_strategies dict
  • Numeric tolerance: optional numeric_tolerance for the numeric strategy
  • Returns NOT_EVALUATED when reference invocations are unavailable
  • Returns per-invocation NOT_EVALUATED when an invocation has zero expected tool calls

Per-call scoring (for each matched expected tool call):

  • score the expected argument keys only
  • average the per-argument match scores
  • if the expected argument dict is empty, the tool-call score is 1.0

For unmatched expected calls: score 0.0.

Case score is the mean of evaluated per-invocation scores, excluding NOT_EVALUATED invocations.

Impact on your work

I am building an evaluation harness for ADK-based agents and need a deterministic, regression-friendly metric for argument-level correctness. Without one, regressions where the agent picks the right tool but supplies wrong or partial arguments are invisible to built-in scoring, which slows down iterative improvement.

No specific timeline — but this would unblock more reliable eval-driven development for any team using ADK.

Willingness to contribute

Yes. I am willing to implement this and submit a focused PR with unit tests, following the contribution guidelines (CLA signed).


🟡 Recommended Information

Describe Alternatives You've Considered

  • Custom metric function via custom_metrics: works today but requires per-project boilerplate, is not discoverable, and cannot be referenced by name in eval-set configs the way built-in metrics can.
  • Embedding argument matching inside tool_trajectory_in_order: would conflate trajectory shape with argument quality and break existing eval sets that rely on the trajectory metric's current binary semantics.

Proposed API / Implementation

# eval_metrics.py
ArgumentStrategy = Literal["exact", "casefold_exact", "numeric", "contains"]

class ToolParameterMatchCriterion(BaseCriterion):
    match_mode: Literal["name_only", "name_and_args", "name_and_required_args"] = "name_and_required_args"
    default_strategy: ArgumentStrategy = "exact"
    per_arg_strategies: dict[str, ArgumentStrategy] | None = None
    numeric_tolerance: float = 0.0
    ordered: bool = True

# parameter_match_evaluator.py
class ToolParameterMatchEvaluator(Evaluator):
    criterion_type: ClassVar = ToolParameterMatchCriterion

    async def evaluate_invocations(
        self,
        actual_invocations,
        expected_invocations,
        conversation_scenario=None,
    ) -> EvaluationResult:
        ...

Registration in _get_default_metric_evaluator_registry():

MetricInfo(
    metric_name="tool_parameter_match",
    description="Argument-level match score for deterministically aligned tool calls.",
),

Usage in an eval set config:

{
  "metric": "tool_parameter_match",
  "criterion": {
    "threshold": 0.8,
    "matchMode": "name_and_required_args",
    "defaultStrategy": "exact",
    "perArgStrategies": { "temperature": "numeric" },
    "numericTolerance": 0.5
  }
}

Additional Context

Filed alongside #5306 (tool_trajectory_f1). The two metrics are independent but address the same workflow gap — partial-credit, deterministic tool-use scoring for regression tracking. They are designed to compose: trajectory scoring decides whether the right tools were called; parameter scoring decides whether each call's arguments were correct.

Issue #4794 proposes adding ignore_args to the existing trajectory evaluator. tool_parameter_match is complementary — it provides positive argument-quality scoring rather than argument filtering.

I'd appreciate early feedback on two API questions:

  1. Is a separate ToolParameterMatchCriterion preferred, or should argument-quality scoring be folded into an existing argument-aware criterion?
  2. Should call-alignment logic be shared with tool_trajectory_in_order / tool_trajectory_f1 via a common helper, or recomputed inside each metric independently?

Metadata

Metadata

Labels

eval[Component] This issue is related to evaluationneeds review[Status] The PR/issue is awaiting review from the maintainer

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions