feat(streaming): emit OTel metrics for ttft, tps, token counts by x · Pull Request #347 · scaleapi/scale-agentex-python

x · 2026-05-07T16:22:50Z

Metrics

Eight OTel metrics in agentex.lib.core.observability. Histograms are streaming-only (need per-chunk visibility); counters are provider-agnostic and fire from RunHooks.on_llm_end.

Metric	Type	Source
`agentex.llm.ttft`	histogram (ms)	streaming model — request → first content token
`agentex.llm.ttat`	histogram (ms)	streaming model — request → first answering token (excludes reasoning chunks)
`agentex.llm.tps`	histogram (tok/s)	streaming model — over generation window (`first_token_at`→`last_token_at`)
`agentex.llm.requests`	counter (with `status`)	`LLMMetricsHooks`
`agentex.llm.input_tokens`	counter	`LLMMetricsHooks`
`agentex.llm.output_tokens`	counter	`LLMMetricsHooks`
`agentex.llm.cached_input_tokens`	counter	`LLMMetricsHooks`
`agentex.llm.reasoning_tokens`	counter	`LLMMetricsHooks`

status values: success / rate_limit / server_error / client_error / timeout / network_error / other_error.

Why both layers

Hooks fire on every Runner.run / Runner.run_streamed call — async-Temporal and sync ACP both pick them up. TemporalStreamingHooks inherits LLMMetricsHooks so existing async users get the counters automatically. Sync ACP opts in: Runner.run_streamed(..., hooks=LLMMetricsHooks()).

The streaming model owns ttft/ttat/tps because hooks can't see individual chunks.

Cardinality

model on histograms. model × status on the request counter. No task-id, no per-call labels.

Robustness

on_llm_end reaches into nested optional fields some providers don't populate (litellm/Anthropic). Defensive structure:

requests emits outside the usage-extraction try — success counter fires even if the response shape is unusual.
Token-counter block has its own try/except. ... or 0 handles None.
record_llm_failure (used in error path) has its own try/except — exporter failures never shadow the original LLM exception.

Tests cover: missing .usage, missing input_tokens_details, all-None token values.

Tested

20 unit tests in core/observability/tests/. All passing.
Local sync ACP agent hitting SGP /v5/ (litellm-equivalent) via OpenAIChatCompletionsModel — 3 messages, no crashes.
aimi-scale -k 2 eval pinned to this commit — 2/2 examples, metrics visible in Mimir.

Notes

stream_start_perf is bookmarked before responses.create() so ttft/ttat include network RTT.

If upstream OpenAI Agents SDK ever adds metrics hooks on Model, this is a clean swap-out.

Greptile Summary

Adds eight OTel metrics (ttft, ttat, tps histograms; requests, input_tokens, output_tokens, cached_input_tokens, reasoning_tokens counters) via a lazy singleton (get_llm_metrics()), wired into LLMMetricsHooks (new RunHooks subclass) and TemporalStreamingModel for streaming-only histograms. TemporalStreamingHooks now inherits LLMMetricsHooks so existing Temporal users pick up counters automatically.
stream_start_perf is correctly placed before responses.create() in the updated code, and instruments are lazily created so a MeterProvider configured after import still binds.
Several robustness concerns flagged in the prior review remain open: the token-counter block shares a single try/except causing silent cascade drops, the str(agent.model) fallback can produce memory-address cardinality when agent.model is a Model instance, and the streaming-metrics emit block is unwrapped — an OTel exporter exception there would be swallowed by the outer except handler.

Confidence Score: 4/5

Safe to merge with awareness that several P1s from the prior review round are still open; no new blocking issues introduced.

The prior review surfaced multiple P1 findings (single try/except cascade drop, str(agent.model) cardinality, unwrapped OTel emit block). Those conversations remain unresolved in the thread history, holding the score at 4.

src/agentex/lib/core/observability/llm_metrics_hooks.py (token counter cascade + cardinality), src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py (unwrapped OTel emit and record_llm_failure ordering).

Important Files Changed

Filename	Overview
src/agentex/lib/core/observability/llm_metrics.py	New file — defines lazy LLMMetrics singleton with 8 OTel instruments and classify_status helper. Clean design, lazy init avoids no-op provider binding, bounded cardinality.
src/agentex/lib/core/observability/llm_metrics_hooks.py	New RunHooks adapter — token counters share a single try/except block so a failure on cached_tokens silently drops reasoning_tokens; str(agent.model) can produce memory-address cardinality when model is a Model instance (both flagged in prior review).
src/agentex/lib/core/observability/tests/test_llm_metrics_hooks.py	Good coverage of normal, zero-token, and failure cases; missing positive assertion for output_tokens in test_missing_token_details_skips_those_counters.
src/agentex/lib/core/temporal/plugins/openai_agents/hooks/hooks.py	TemporalStreamingHooks now inherits LLMMetricsHooks instead of RunHooks directly — clean change that adds metric emission to existing Temporal streaming users.
src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py	Adds ttft/ttat/tps emission after streaming loop; stream_start_perf correctly placed before responses.create(). Prior-review issues remain: unwrapped OTel emit block, record_llm_failure ordering.

Sequence Diagram

sequenceDiagram
    participant SDK as OpenAI Agents SDK
    participant TSM as TemporalStreamingModel
    participant OTel as OTel (get_llm_metrics)
    participant Hooks as LLMMetricsHooks.on_llm_end

    SDK->>TSM: get_response()
    TSM->>TSM: "stream_start_perf = perf_counter()"
    TSM->>TSM: await responses.create()
    loop Each stream event
        TSM->>TSM: Record first_token_at / last_token_at / first_answer_at
    end
    TSM->>OTel: ttft_ms.record()
    TSM->>OTel: ttat_ms.record()
    TSM->>OTel: tps.record()
    TSM-->>SDK: ModelResponse
    SDK->>Hooks: on_llm_end(agent, response)
    Hooks->>OTel: "requests.add(1, status=success)"
    Hooks->>OTel: input_tokens.add()
    Hooks->>OTel: output_tokens.add()
    Hooks->>OTel: cached_input_tokens.add()
    Hooks->>OTel: reasoning_tokens.add()
    note over TSM,OTel: Error path
    TSM-xTSM: Exception in streaming loop
    TSM->>OTel: record_llm_failure(model, exc)
    TSM-->>SDK: raise typed LLM exception

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
src/agentex/lib/core/observability/tests/test_llm_metrics_hooks.py:165-170
The test documents that `input_tokens` is emitted before the failing `input_tokens_details` access, but doesn't assert that `output_tokens` is also emitted. Since `m.output_tokens.add(...)` runs at line 45 of `llm_metrics_hooks.py` — before the AttributeError on `input_tokens_details` — it should be emitted here too. Without the assertion this behaviour is unverified, and a future reordering of those lines could silently break `output_tokens` reporting.

```suggestion
        # input_tokens.add and output_tokens.add fire before the nested attribute access.
        m.input_tokens.add.assert_called_once_with(100, {"model": "gpt-5"})
        m.output_tokens.add.assert_called_once_with(50, {"model": "gpt-5"})
        # cached_input_tokens / reasoning_tokens skipped — the AttributeError
        # bailed before they could be called.
        m.cached_input_tokens.add.assert_not_called()
        m.reasoning_tokens.add.assert_not_called()
```

_{Reviews (8): Last reviewed commit: "fix: add @override on LLMMetricsHooks.on..." | Re-trigger Greptile}

…counts Adds six metrics to TemporalStreamingModel.get_response so applications that configure an OTel MeterProvider can see streaming-call behavior without per-app instrumentation: - agentex.llm.ttft (histogram, ms): time from request start to first content delta. Captured on the first ResponseTextDeltaEvent / ResponseReasoningTextDeltaEvent / ResponseReasoningSummaryTextDeltaEvent. - agentex.llm.tps (histogram, tokens/s): output_tokens / stream_duration. Use 1/tps for time-per-output-token (tpot). - agentex.llm.input_tokens / output_tokens / cached_input_tokens / reasoning_tokens (counters): pulled from the captured ResponsesAPI Usage at end-of-stream. Cache hit rate is computed at query time as rate(cached_input_tokens) / rate(input_tokens). Why - The data was already captured (line 854 captured_usage = response.usage) but never emitted as metrics. Apps could only see total LLM call duration, not the meaningful breakdowns. - Doing this in the SDK rather than each app means every consumer of TemporalStreamingModel gets the metrics for free. - Cardinality is bounded — only `model` is a metric attribute. Resource attributes (service.name, k8s.*, etc.) come from the application's configured OTel resource, so cross-app comparisons work cleanly in Mimir/Prometheus. The meter is a no-op when no MeterProvider is configured, so this is safe for apps that don't run with OTel.

Three changes from PR #347 review: 1. Move stream_start_perf above responses.create() so ttft captures the full request-to-first-token latency (HTTP round-trip + model TTFB), not just post-connect event-loop delay. Drop the duplicate assignment that was being overwritten after the await. 2. Track last_token_at alongside first_token_at and use the generation window (first→last delta) as the tps denominator. Total stream duration was inflated by tool-call argument deltas, reasoning events, and stream_update awaits — making tps under-report the model's actual generation speed for agentic responses. 3. Lazy-create the OTel instruments inside _StreamingMetrics rather than at import time, so a MeterProvider configured *after* this module is imported still binds correctly. Apps that bootstrap OTel in a startup hook (common with lazy-init patterns) would otherwise silently send to the no-op provider.

- Bookmark first/last token timestamps on ResponseFunctionCallArgumentsDeltaEvent too so the tps generation window covers all event types whose tokens land in usage.output_tokens. Previously the numerator counted argument tokens but the denominator excluded their generation time, inflating tps for tool-heavy responses. - Lifted the bookmarking out of the text-delta branch into a single up-front check covering all four token-producing event types — cleaner than duplicating across branches. - Documented the single-token skip case (window collapses to 0) inline at the guard. TPS is undefined for a one-token response so emitting nothing is correct; the comment makes the intent visible to future readers.

…t counter Two follow-up changes from the PR review: 1. Move the LLM metric instruments from _StreamingMetrics in temporal_streaming_model.py to a new module: agentex.lib.core.observability.llm_metrics Public API: get_llm_metrics() returns a singleton LLMMetrics with the same six instruments (ttft, tps, input_tokens, output_tokens, cached_input_tokens, reasoning_tokens) plus a new requests counter. This makes the temporal+openai_agents plugin one of several future call sites — the sync ACP path and the Claude SDK plugin can record to the same instruments without redefining names, units, or descriptions. Keeps cross-provider naming consistent. 2. Add agentex.llm.requests counter with a status label so 429s, 5xxs, timeouts, and other failures are observable on the SDK side without scraping logs. classify_status() maps exception types to a small fixed set (success / rate_limit / server_error / client_error / timeout / network_error / other_error) by class name, so it works across OpenAI, Anthropic, and other provider SDKs that use similar exception naming. Recorded in two places: success path (alongside token counters) and the existing get_response except handler (so terminal failures emit a counter event before re-raising). Cardinality remains bounded — model + status (7 values) on the counter; all other metrics keep just `model`.

ttft fires on the first content delta of any kind, which for reasoning models means the first reasoning chunk — arrives quickly even when the user-perceived latency is much longer. ttat fires only on the first user-visible answer token (text delta or tool-call arguments delta), excluding reasoning chunks. For non-reasoning models the two are equal; for gpt-5-class / o-series models they differ by the reasoning duration. This pairs with ttft for "did the model start thinking quickly?" vs "how long did the user wait for an answer?" — both are valuable signals that mean different things on reasoning workloads. Implementation: a third bookmark variable (``first_answer_at``) set inside the same up-front event-type check, restricted to ResponseTextDeltaEvent / ResponseFunctionCallArgumentsDeltaEvent. Adds one new histogram (``agentex.llm.ttat``) — same labels and units as ttft.

If get_llm_metrics().requests.add() raises (misbehaving exporter, OTel SDK bug, network blip mid-export), the original LLM exception would be shadowed by the metric error. Callers — retry logic, circuit breakers, the OpenAI Agents Temporal plugin's retryable/non-retryable classifier — inspect the typed exception (RateLimitError, APITimeoutError, etc.) and would silently break with an unexpected OTel exception in its place. Wrap the .add() call in a bare try/except so the metric is best-effort and the typed LLM exception always propagates.

Splits metric emission by what each layer can see: - agentex.lib.core.observability.llm_metrics_hooks.LLMMetricsHooks (RunHooks subclass) emits agentex.llm.requests + the four token counters in on_llm_end. Works for any RunHooks-aware path. - TemporalStreamingHooks now inherits from LLMMetricsHooks so the async path picks up the same metrics automatically. - TemporalStreamingModel keeps only the streaming-only metrics (ttft, ttat, tps) — those need per-chunk visibility hooks can't provide. Failure path uses the new record_llm_failure helper. This makes adding the sync ACP path trivial later: pass LLMMetricsHooks() to Runner.run from services/adk/providers/openai.py and it'll emit the same metrics with no double-counting. Tests cover: - classify_status branches (rate_limit / timeout / server_error / network_error / client_error / other_error / success) - get_llm_metrics singleton + instrument presence - LLMMetricsHooks.on_llm_end emits requests + token counters with the right model attribute - Both the hooks path and record_llm_failure swallow exporter exceptions so callers don't break when metrics fail

…roviders) If a provider returns a ModelResponse with a Usage shape the OpenAI Agents SDK didn't fully normalize — missing input_tokens_details, missing usage entirely, None token values — we want to record what we can and skip the rest, never crash the caller. - Move requests.add outside the usage-extraction try block so the success counter still fires when usage access raises (e.g., None). - Add three tests covering: response with raising .usage property, Usage missing input_tokens_details, and Usage with all-None token values.

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

x added 2 commits May 7, 2026 12:40

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py Outdated

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py

joshua0chen approved these changes May 7, 2026

View reviewed changes

x added 3 commits May 8, 2026 00:24

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

Comment thread src/agentex/lib/core/observability/llm_metrics_hooks.py

Comment thread src/agentex/lib/core/observability/llm_metrics_hooks.py

x added 2 commits May 8, 2026 01:41

fix: add @OverRide on LLMMetricsHooks.on_llm_end for pyright

56cd1f7

x changed the base branch from main to next May 8, 2026 06:49

smoreinis approved these changes May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(streaming): emit OTel metrics for ttft, tps, token counts#347

feat(streaming): emit OTel metrics for ttft, tps, token counts#347
x wants to merge 9 commits intonextfrom
x/llm-metrics-instrumentation

x commented May 7, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

x commented May 7, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Metrics

Why both layers

Cardinality

Robustness

Tested

Notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

x commented May 7, 2026 •

edited by greptile-apps Bot

Loading