Skip to content

feat(streaming): emit OTel metrics for ttft, tps, token counts#347

Open
x wants to merge 9 commits intonextfrom
x/llm-metrics-instrumentation
Open

feat(streaming): emit OTel metrics for ttft, tps, token counts#347
x wants to merge 9 commits intonextfrom
x/llm-metrics-instrumentation

Conversation

@x
Copy link
Copy Markdown
Contributor

@x x commented May 7, 2026

Metrics

Eight OTel metrics in agentex.lib.core.observability. Histograms are streaming-only (need per-chunk visibility); counters are provider-agnostic and fire from RunHooks.on_llm_end.

Metric Type Source
agentex.llm.ttft histogram (ms) streaming model — request → first content token
agentex.llm.ttat histogram (ms) streaming model — request → first answering token (excludes reasoning chunks)
agentex.llm.tps histogram (tok/s) streaming model — over generation window (first_token_atlast_token_at)
agentex.llm.requests counter (with status) LLMMetricsHooks
agentex.llm.input_tokens counter LLMMetricsHooks
agentex.llm.output_tokens counter LLMMetricsHooks
agentex.llm.cached_input_tokens counter LLMMetricsHooks
agentex.llm.reasoning_tokens counter LLMMetricsHooks

status values: success / rate_limit / server_error / client_error / timeout / network_error / other_error.

Why both layers

Hooks fire on every Runner.run / Runner.run_streamed call — async-Temporal and sync ACP both pick them up. TemporalStreamingHooks inherits LLMMetricsHooks so existing async users get the counters automatically. Sync ACP opts in: Runner.run_streamed(..., hooks=LLMMetricsHooks()).

The streaming model owns ttft/ttat/tps because hooks can't see individual chunks.

Cardinality

model on histograms. model × status on the request counter. No task-id, no per-call labels.

Robustness

on_llm_end reaches into nested optional fields some providers don't populate (litellm/Anthropic). Defensive structure:

  • requests emits outside the usage-extraction try — success counter fires even if the response shape is unusual.
  • Token-counter block has its own try/except. ... or 0 handles None.
  • record_llm_failure (used in error path) has its own try/except — exporter failures never shadow the original LLM exception.

Tests cover: missing .usage, missing input_tokens_details, all-None token values.

Tested

  • 20 unit tests in core/observability/tests/. All passing.
  • Local sync ACP agent hitting SGP /v5/ (litellm-equivalent) via OpenAIChatCompletionsModel — 3 messages, no crashes.
  • aimi-scale -k 2 eval pinned to this commit — 2/2 examples, metrics visible in Mimir.

Notes

stream_start_perf is bookmarked before responses.create() so ttft/ttat include network RTT.

If upstream OpenAI Agents SDK ever adds metrics hooks on Model, this is a clean swap-out.

Greptile Summary

  • Adds eight OTel metrics (ttft, ttat, tps histograms; requests, input_tokens, output_tokens, cached_input_tokens, reasoning_tokens counters) via a lazy singleton (get_llm_metrics()), wired into LLMMetricsHooks (new RunHooks subclass) and TemporalStreamingModel for streaming-only histograms. TemporalStreamingHooks now inherits LLMMetricsHooks so existing Temporal users pick up counters automatically.
  • stream_start_perf is correctly placed before responses.create() in the updated code, and instruments are lazily created so a MeterProvider configured after import still binds.
  • Several robustness concerns flagged in the prior review remain open: the token-counter block shares a single try/except causing silent cascade drops, the str(agent.model) fallback can produce memory-address cardinality when agent.model is a Model instance, and the streaming-metrics emit block is unwrapped — an OTel exporter exception there would be swallowed by the outer except handler.

Confidence Score: 4/5

Safe to merge with awareness that several P1s from the prior review round are still open; no new blocking issues introduced.

The prior review surfaced multiple P1 findings (single try/except cascade drop, str(agent.model) cardinality, unwrapped OTel emit block). Those conversations remain unresolved in the thread history, holding the score at 4.

src/agentex/lib/core/observability/llm_metrics_hooks.py (token counter cascade + cardinality), src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py (unwrapped OTel emit and record_llm_failure ordering).

Important Files Changed

Filename Overview
src/agentex/lib/core/observability/llm_metrics.py New file — defines lazy LLMMetrics singleton with 8 OTel instruments and classify_status helper. Clean design, lazy init avoids no-op provider binding, bounded cardinality.
src/agentex/lib/core/observability/llm_metrics_hooks.py New RunHooks adapter — token counters share a single try/except block so a failure on cached_tokens silently drops reasoning_tokens; str(agent.model) can produce memory-address cardinality when model is a Model instance (both flagged in prior review).
src/agentex/lib/core/observability/tests/test_llm_metrics_hooks.py Good coverage of normal, zero-token, and failure cases; missing positive assertion for output_tokens in test_missing_token_details_skips_those_counters.
src/agentex/lib/core/temporal/plugins/openai_agents/hooks/hooks.py TemporalStreamingHooks now inherits LLMMetricsHooks instead of RunHooks directly — clean change that adds metric emission to existing Temporal streaming users.
src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py Adds ttft/ttat/tps emission after streaming loop; stream_start_perf correctly placed before responses.create(). Prior-review issues remain: unwrapped OTel emit block, record_llm_failure ordering.

Sequence Diagram

sequenceDiagram
    participant SDK as OpenAI Agents SDK
    participant TSM as TemporalStreamingModel
    participant OTel as OTel (get_llm_metrics)
    participant Hooks as LLMMetricsHooks.on_llm_end

    SDK->>TSM: get_response()
    TSM->>TSM: "stream_start_perf = perf_counter()"
    TSM->>TSM: await responses.create()
    loop Each stream event
        TSM->>TSM: Record first_token_at / last_token_at / first_answer_at
    end
    TSM->>OTel: ttft_ms.record()
    TSM->>OTel: ttat_ms.record()
    TSM->>OTel: tps.record()
    TSM-->>SDK: ModelResponse
    SDK->>Hooks: on_llm_end(agent, response)
    Hooks->>OTel: "requests.add(1, status=success)"
    Hooks->>OTel: input_tokens.add()
    Hooks->>OTel: output_tokens.add()
    Hooks->>OTel: cached_input_tokens.add()
    Hooks->>OTel: reasoning_tokens.add()
    note over TSM,OTel: Error path
    TSM-xTSM: Exception in streaming loop
    TSM->>OTel: record_llm_failure(model, exc)
    TSM-->>SDK: raise typed LLM exception
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
src/agentex/lib/core/observability/tests/test_llm_metrics_hooks.py:165-170
The test documents that `input_tokens` is emitted before the failing `input_tokens_details` access, but doesn't assert that `output_tokens` is also emitted. Since `m.output_tokens.add(...)` runs at line 45 of `llm_metrics_hooks.py` — before the AttributeError on `input_tokens_details` — it should be emitted here too. Without the assertion this behaviour is unverified, and a future reordering of those lines could silently break `output_tokens` reporting.

```suggestion
        # input_tokens.add and output_tokens.add fire before the nested attribute access.
        m.input_tokens.add.assert_called_once_with(100, {"model": "gpt-5"})
        m.output_tokens.add.assert_called_once_with(50, {"model": "gpt-5"})
        # cached_input_tokens / reasoning_tokens skipped — the AttributeError
        # bailed before they could be called.
        m.cached_input_tokens.add.assert_not_called()
        m.reasoning_tokens.add.assert_not_called()
```

Reviews (8): Last reviewed commit: "fix: add @override on LLMMetricsHooks.on..." | Re-trigger Greptile

…counts

Adds six metrics to TemporalStreamingModel.get_response so applications
that configure an OTel MeterProvider can see streaming-call behavior
without per-app instrumentation:

- agentex.llm.ttft (histogram, ms): time from request start to first
  content delta. Captured on the first ResponseTextDeltaEvent /
  ResponseReasoningTextDeltaEvent / ResponseReasoningSummaryTextDeltaEvent.
- agentex.llm.tps (histogram, tokens/s): output_tokens / stream_duration.
  Use 1/tps for time-per-output-token (tpot).
- agentex.llm.input_tokens / output_tokens / cached_input_tokens /
  reasoning_tokens (counters): pulled from the captured ResponsesAPI
  Usage at end-of-stream. Cache hit rate is computed at query time as
  rate(cached_input_tokens) / rate(input_tokens).

Why
- The data was already captured (line 854 captured_usage = response.usage)
  but never emitted as metrics. Apps could only see total LLM call
  duration, not the meaningful breakdowns.
- Doing this in the SDK rather than each app means every consumer of
  TemporalStreamingModel gets the metrics for free.
- Cardinality is bounded — only `model` is a metric attribute. Resource
  attributes (service.name, k8s.*, etc.) come from the application's
  configured OTel resource, so cross-app comparisons work cleanly in
  Mimir/Prometheus.

The meter is a no-op when no MeterProvider is configured, so this is
safe for apps that don't run with OTel.
x added 2 commits May 7, 2026 12:40
Three changes from PR #347 review:

1. Move stream_start_perf above responses.create() so ttft captures the
   full request-to-first-token latency (HTTP round-trip + model TTFB),
   not just post-connect event-loop delay. Drop the duplicate assignment
   that was being overwritten after the await.

2. Track last_token_at alongside first_token_at and use the generation
   window (first→last delta) as the tps denominator. Total stream
   duration was inflated by tool-call argument deltas, reasoning events,
   and stream_update awaits — making tps under-report the model's
   actual generation speed for agentic responses.

3. Lazy-create the OTel instruments inside _StreamingMetrics rather
   than at import time, so a MeterProvider configured *after* this
   module is imported still binds correctly. Apps that bootstrap OTel
   in a startup hook (common with lazy-init patterns) would otherwise
   silently send to the no-op provider.
- Bookmark first/last token timestamps on ResponseFunctionCallArgumentsDeltaEvent
  too so the tps generation window covers all event types whose tokens land in
  usage.output_tokens. Previously the numerator counted argument tokens but
  the denominator excluded their generation time, inflating tps for tool-heavy
  responses.

- Lifted the bookmarking out of the text-delta branch into a single up-front
  check covering all four token-producing event types — cleaner than
  duplicating across branches.

- Documented the single-token skip case (window collapses to 0) inline at the
  guard. TPS is undefined for a one-token response so emitting nothing is
  correct; the comment makes the intent visible to future readers.
…t counter

Two follow-up changes from the PR review:

1. Move the LLM metric instruments from _StreamingMetrics in
   temporal_streaming_model.py to a new module:
     agentex.lib.core.observability.llm_metrics

   Public API: get_llm_metrics() returns a singleton LLMMetrics with
   the same six instruments (ttft, tps, input_tokens, output_tokens,
   cached_input_tokens, reasoning_tokens) plus a new requests counter.

   This makes the temporal+openai_agents plugin one of several future
   call sites — the sync ACP path and the Claude SDK plugin can
   record to the same instruments without redefining names, units,
   or descriptions. Keeps cross-provider naming consistent.

2. Add agentex.llm.requests counter with a status label so 429s,
   5xxs, timeouts, and other failures are observable on the SDK
   side without scraping logs. classify_status() maps exception
   types to a small fixed set (success / rate_limit / server_error
   / client_error / timeout / network_error / other_error) by class
   name, so it works across OpenAI, Anthropic, and other provider
   SDKs that use similar exception naming.

   Recorded in two places: success path (alongside token counters)
   and the existing get_response except handler (so terminal
   failures emit a counter event before re-raising).

Cardinality remains bounded — model + status (7 values) on the
counter; all other metrics keep just `model`.
x added 3 commits May 8, 2026 00:24
ttft fires on the first content delta of any kind, which for reasoning
models means the first reasoning chunk — arrives quickly even when the
user-perceived latency is much longer. ttat fires only on the first
user-visible answer token (text delta or tool-call arguments delta),
excluding reasoning chunks. For non-reasoning models the two are equal;
for gpt-5-class / o-series models they differ by the reasoning duration.

This pairs with ttft for "did the model start thinking quickly?" vs
"how long did the user wait for an answer?" — both are valuable signals
that mean different things on reasoning workloads.

Implementation: a third bookmark variable (``first_answer_at``) set
inside the same up-front event-type check, restricted to
ResponseTextDeltaEvent / ResponseFunctionCallArgumentsDeltaEvent.
Adds one new histogram (``agentex.llm.ttat``) — same labels and units
as ttft.
If get_llm_metrics().requests.add() raises (misbehaving exporter, OTel
SDK bug, network blip mid-export), the original LLM exception would be
shadowed by the metric error. Callers — retry logic, circuit breakers,
the OpenAI Agents Temporal plugin's retryable/non-retryable
classifier — inspect the typed exception (RateLimitError,
APITimeoutError, etc.) and would silently break with an unexpected
OTel exception in its place.

Wrap the .add() call in a bare try/except so the metric is best-effort
and the typed LLM exception always propagates.
Splits metric emission by what each layer can see:
- agentex.lib.core.observability.llm_metrics_hooks.LLMMetricsHooks
  (RunHooks subclass) emits agentex.llm.requests + the four token
  counters in on_llm_end. Works for any RunHooks-aware path.
- TemporalStreamingHooks now inherits from LLMMetricsHooks so the
  async path picks up the same metrics automatically.
- TemporalStreamingModel keeps only the streaming-only metrics
  (ttft, ttat, tps) — those need per-chunk visibility hooks can't
  provide. Failure path uses the new record_llm_failure helper.

This makes adding the sync ACP path trivial later: pass
LLMMetricsHooks() to Runner.run from services/adk/providers/openai.py
and it'll emit the same metrics with no double-counting.

Tests cover:
- classify_status branches (rate_limit / timeout / server_error /
  network_error / client_error / other_error / success)
- get_llm_metrics singleton + instrument presence
- LLMMetricsHooks.on_llm_end emits requests + token counters with
  the right model attribute
- Both the hooks path and record_llm_failure swallow exporter
  exceptions so callers don't break when metrics fail
Comment thread src/agentex/lib/core/observability/llm_metrics_hooks.py
Comment thread src/agentex/lib/core/observability/llm_metrics_hooks.py
x added 2 commits May 8, 2026 01:41
…roviders)

If a provider returns a ModelResponse with a Usage shape the OpenAI Agents
SDK didn't fully normalize — missing input_tokens_details, missing usage
entirely, None token values — we want to record what we can and skip the
rest, never crash the caller.

- Move requests.add outside the usage-extraction try block so the success
  counter still fires when usage access raises (e.g., None).
- Add three tests covering: response with raising .usage property,
  Usage missing input_tokens_details, and Usage with all-None token values.
@x x changed the base branch from main to next May 8, 2026 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants