Skip to content

feat(etl): Wave 4D — 12 beta provider normalizers (full codeburn catalog coverage)#79

Merged
0bserver07 merged 1 commit into
mainfrom
feat/etl-beta-normalizers
May 6, 2026
Merged

feat(etl): Wave 4D — 12 beta provider normalizers (full codeburn catalog coverage)#79
0bserver07 merged 1 commit into
mainfrom
feat/etl-beta-normalizers

Conversation

@0bserver07
Copy link
Copy Markdown
Owner

Summary

  • 12 beta provider normalizers ship Wave 4D: Codeium (stub), Continue, Copilot, Cursor Agent, Droid, Gemini, KiloCode, Kiro, OpenClaw, OpenCode, Pi/OMP, Qwen, Roo Code.
  • ETL pipeline now covers all 16 providers from the codeburn catalog. Each normalizer subclasses Normalizer from stackunderflow/etl/normalize/base.py, calls _build_event() for the canonical row shape, stamps cost_source per spec, and preserves provider-specific fields in raw_extras.
  • Beta providers stay opt-in via the existing STACKUNDERFLOW_BETA_* env flags — registering at import time is harmless when those adapters are off because no rows ever land with the matching provider value.

Implementation notes

  • Cached-subtraction rule applied to Gemini + Qwen exactly per catalog: input = promptTokenCount - cachedContentTokenCount, output = candidatesTokenCount + thoughtsTokenCount, cache_read = cachedContentTokenCount, cache_create = 0.
  • Reasoning fold-in applied to OpenCode: output = tokens.output + tokens.reasoning. Droid does the same for its thinkingTokens slot.
  • KiloCode + RooCode subclass ClineNormalizer directly — same on-disk format, only provider_name differs.
  • Codeium is a discovery-only stub that yields zero events; the registry entry exists so lookups never KeyError when the beta flag is on.
  • Pi covers OMP — same parser logic, different on-disk roots (~/.pi/agent/sessions/ vs. ~/.omp/agent/sessions/); we register PiNormalizer under both provider names.
  • Cursor Agent + Kiro always stamp cost_source='estimated' because their sources never carry per-message tokens.

Files touched

  • stackunderflow/etl/normalize/{codeium,continue_,copilot,cursor_agent,droid,gemini,kilocode,kiro,openclaw,opencode,pi,qwen,roocode}.py (13 new)
  • stackunderflow/etl/normalize/__init__.py (appended 14 registrations: pi → pi+omp)
  • tests/stackunderflow/etl/normalize/test_<provider>.py (13 new files, 77 new tests)
  • CHANGELOG.md

No routes, marts, watcher, or backfill code touched — strictly per scope.

Test plan

  • pytest tests/ -q — 1551 passed, 2 skipped (was 1474, +77 new tests)
  • ruff check stackunderflow/etl/normalize/ tests/stackunderflow/etl/normalize/ — clean
  • All 4 default-on normalizers (claude/codex/cursor/cline) still pass unchanged
  • Sample event rows verified for every provider — token math matches catalog spec, cost_usd computes correctly via the rate card

🤖 Generated with Claude Code

…log coverage)

Adds Normalizer subclasses for the 12 beta providers Wave 2A left for
later (Codeium, Continue, Copilot, Cursor Agent, Droid, Gemini,
KiloCode, Kiro, OpenClaw, OpenCode, Pi/OMP, Qwen, Roo Code). The ETL
pipeline now covers all 16 providers from the codeburn catalog.

Token semantics match the catalog spec exactly — Gemini and Qwen apply
the cached-subtraction rule (input = promptTokenCount -
cachedContentTokenCount, output = candidatesTokenCount +
thoughtsTokenCount). OpenCode folds reasoning into output. Cursor Agent
and Kiro stamp cost_source='estimated' unconditionally because their
sources don't carry per-message tokens. Codeium is a discovery-only
stub that yields zero events. KiloCode + RooCode subclass Cline since
they share the on-disk format (api_req_started.text JSON blob).

Beta providers stay opt-in via the existing STACKUNDERFLOW_BETA_*
adapter flags — registering normalizers here is harmless when the
matching adapter is off because no rows ever land with that provider
value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@0bserver07 0bserver07 merged commit c678b2d into main May 6, 2026
9 checks passed
0bserver07 added a commit that referenced this pull request May 6, 2026
Bumps to 0.7.0. Consolidates the [Unreleased] CHANGELOG entries from
the 11 ETL PRs (#72, #73, #74, #75, #76, #79, #81, #80, #78, #77, #82)
into a single [0.7.0] section.

New: docs/HANDOFF.md — state-of-the-codebase walkthrough for incoming
agents. Architecture map, recent history, key gotchas, what's left,
files-to-read-first.

End-state on the maintainer's real store:
  150,337 usage_events
  Marts populated and watermarks in sync
  Dashboard cold-load 2.5s → <50ms warm
  Watcher 155ms end-to-end source-file-write → dashboard-data-fresh

1598 backend tests passing, 2 skipped, 11 deselected (slow suite).
Frontend typecheck + build clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant