Skip to content

feat: refactor usage schema for richer event dimensions#5

Open
lohanidamodar wants to merge 18 commits into
mainfrom
feat/CLO-4383-usage-refactor
Open

feat: refactor usage schema for richer event dimensions#5
lohanidamodar wants to merge 18 commits into
mainfrom
feat/CLO-4383-usage-refactor

Conversation

@lohanidamodar
Copy link
Copy Markdown
Contributor

Summary

  • Drop the userAgent raw column and the tags JSON column on both tables.
  • Add 21 dedicated dimension columns to events (service, resource type vs id vs internal-id, team id + internal-id, region, hostname, and 12 UA-parsed fields per the Session model).
  • Add team + resource id columns to gauges.
  • Widen the daily SummingMergeTree MV with resource and team dimensions for billing breakdowns.
  • Strict tag-key validation: unknown keys throw at write time.
  • Country/region values are lowercased; empty strings coerced to null.

Test plan

  • vendor/bin/phpunit
  • composer check
  • Round-trip integration tests for events and gauges hit the new columns.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 25, 2026

Greptile Summary

This PR replaces the freeform userAgent and tags columns with 21 dedicated dimension columns on the events table, 4 on gauges, and 5 on the daily SummingMergeTree — enabling strict key validation at write time, country/region lowercasing, and a new groupBy query operator for multi-dimensional aggregation.

  • Metric::extractColumns() centralises tag normalisation (coercion, lowercasing, unknown-key rejection) and is shared by both adapters, eliminating the previous copy-paste.
  • The daily MV and table now carry resource, resourceId, resourceInternalId, teamId, and teamInternalId for billing breakdowns, but findDaily() still hard-codes its SELECT to only metric/value/time, so those columns are never returned to callers.
  • The committed .phpunit.result.cache records testValidateRejectsNonArrayTags as a defect — the test no longer exists, indicating the cache was captured mid-refactor rather than after a clean run.

Confidence Score: 4/5

Safe to merge with one fix: findDaily() must include the new daily-table dimension columns in its SELECT list before the billing-breakdown feature will work end-to-end.

The core refactor (schema, extractColumns, groupBy) is well-implemented and thoroughly tested. The one concrete defect is that findDaily() hard-codes ['metric', 'value', 'time'] in its SELECT, so the five new dimension columns written by the materialized view (resource, resourceId, resourceInternalId, teamId, teamInternalId) are silently dropped on read — directly breaking the billing-breakdown goal the PR states for those columns.

src/Usage/Adapter/ClickHouse.php — specifically the findDaily() method's hard-coded column list at line 1965.

Important Files Changed

Filename Overview
src/Usage/Adapter/ClickHouse.php Adds 5 new dimension columns to the daily table schema and MV, and introduces groupBy support with correct validation; but findDaily() hard-codes its SELECT to only metric/value/time, so the new billing-breakdown columns are never returned to callers.
src/Usage/Metric.php Adds EVENT_COLUMNS/GAUGE_COLUMNS constants, 18 new typed getters, shared extractColumns() static helper, and updated schema/index definitions; well-structured refactor with no logic issues.
src/Usage/Adapter/Database.php Delegates to shared Metric::extractColumns(), adds validateGroupByQueries() for the groupBy/groupByInterval pairing rule, and makes host/port configurable via env vars; no issues found.
src/Usage/UsageQuery.php Adds TYPE_GROUP_BY constant and groupBy()/isGroupBy()/extractGroupBy()/removeGroupBy() static helpers following the same pattern as existing groupByInterval methods.
tests/Usage/Adapter/ClickHouseTest.php Comprehensive new tests covering full-dimension round-trips, lowercasing, empty-string-to-null coercion, unknown key rejection, gauge column round-trips, and multi-dimensional groupBy.
tests/Usage/MetricTest.php Updated unit tests for the new schema, constants, and getters; the committed .phpunit.result.cache records testValidateRejectsNonArrayTags as a defect (test no longer exists in the file), indicating the cache was captured mid-refactor rather than after a clean run.
.phpunit.result.cache Cache was committed with testValidateRejectsNonArrayTags listed as a defect — this test no longer exists in the codebase, so the entry is stale, but it indicates the last test run was not fully clean when the cache was captured.

Reviews (4): Last reviewed commit: "feat(database-adapter): validate groupBy..." | Re-trigger Greptile

Comment thread src/Usage/Metric.php Outdated
Comment thread src/Usage/Adapter/ClickHouse.php Outdated
…apters

- region column was sized at 2 (copied from country); region codes like
  'us-east' and 'eu-central-1' are routinely longer. Bumped to 64.
- Extracted the tag→column extraction, normalisation, and strict
  unknown-key check into Metric::extractColumns() so ClickHouse and
  Database adapters share one implementation.
…imit

- hostname column 1024 -> 255. DNS hostnames are capped at 253 chars, so
  the wider size never gained anything but pushed the bloom_filter index
  past the InnoDB/utopia-php-database 768-byte single-attribute index
  prefix limit.
- path column stays at 1024 for data fidelity (URLs with signed query
  strings exceed 255), but its index entry now uses an explicit
  lengths=[255] so MariaDB only indexes the first 255 chars. The index
  validator in utopia/database treats lengths[i] as the indexLength when
  set, taking the column under the 768 cap while preserving the full
  column for stored data.

Caught by CI when the full Database adapter suite ran against MariaDB —
local runs skipped 45 DatabaseTest cases because pdo_mysql wasn't loaded.
Verified locally now via docker-compose: 195/195 passing.
Library-private extension that lives on UsageQuery (not the base
utopia-php/query Query class) so cloud can express "group by service over
1d", "group by path x status over 1h" requests without forcing an SDK
regeneration across every Appwrite client.

- TYPE_GROUP_BY constant + static factory groupBy(attribute).
- isMethod() accepts the new method so Query::parse() round-trips it
  from the JSON wire form.
- isGroupBy / extractGroupBy / removeGroupBy helpers. extractGroupBy
  returns an array (multiple groupBy queries may coexist when bucketing
  by several dimensions at once), unlike extractGroupByInterval which is
  single-instance.
- Unit tests cover the factory, isMethod, helpers, parse round-trip and
  the parsed-Query (base class) extraction path.
Extends findAggregatedFromTable so callers can bucket aggregated rows by
arbitrary dimension columns (service, path, status, ...) alongside the
existing time bucket.

- parseQueries() collects groupBy attributes into parsed[groupBy] and
  validates each against the type-specific column set
  (Metric::EVENT_COLUMNS / Metric::GAUGE_COLUMNS) via the new
  validateGroupByAttribute() helper. Unknown columns raise a
  descriptive Exception listing what is allowed.
- findFromTable() rejects groupBy without groupByInterval - keeps the
  contract simple (caller always supplies a time bucket too) and matches
  the Database adapter behaviour.
- findAggregatedFromTable() appends the escaped dim columns to both
  SELECT and GROUP BY. parseAggregatedResults() already passes unknown
  keys through to the Metric so the dim values surface as plain getters
  (getService, getPath, ...).
- Integration tests: single-dim groupBy(service) over 1d aggregates to
  the expected per-service sums; multi-dim groupBy(service)+groupBy(path)
  over 1h produces the cartesian grouping with correct sums per pair.
… ClickHouse

The Database adapter does not push aggregation through utopia-php/database
(no groupBy on the base Query class), so groupBy queries continue to be
elided from the converted query list - callers get raw rows just as they
do today with groupByInterval. The change is in validation: rejecting
malformed groupBy use up front means the negative-case contract is
identical on both backends.

- validateGroupByQueries() runs at the top of convertQueriesToDatabase.
  Rejects groupBy attributes outside the union of EVENT_COLUMNS and
  GAUGE_COLUMNS (the Database adapter shares one collection for both
  types). Rejects groupBy without an accompanying groupByInterval.
- Negative-case integration tests live in UsageBase so both adapters
  exercise them: unknown attribute throws, groupBy-without-interval
  throws.
@lohanidamodar
Copy link
Copy Markdown
Contributor Author

groupBy support for aggregated finds

Adds UsageQuery::groupBy(attribute) so cloud can express dimensional aggregation like "group by service over 1d" or "group by path x status over 1h" without forcing an SDK regeneration through utopia-php/query.

What landed

  • UsageQuery – new TYPE_GROUP_BY + groupBy() factory; isMethod() accepts it; helpers isGroupBy, extractGroupBy (returns array — multi-dim is supported), removeGroupBy. Parse round-trips via the existing JSON wire form.
  • ClickHouse adapterparseQueries() collects groupBy queries and validates each attribute against Metric::EVENT_COLUMNS / GAUGE_COLUMNS. The aggregated find branch extends SELECT and GROUP BY with the dim columns, so each Metric in the result carries the dimension values via the existing getters (getService(), getPath(), ...). groupBy without groupByInterval is rejected up front.
  • Database adapter – same validation contract (unknown column throws, missing groupByInterval throws). Aggregation itself is not pushed through utopia-php/database, so callers continue to receive raw rows for the actual data path — matching the existing groupByInterval behaviour on this adapter.

Tests

  • 11 new UsageQueryTest cases for factory, isMethod, helpers, parse round-trip.
  • 2 ClickHouse integration tests: single-dim groupBy('service') over 1d aggregates to the expected per-service sums; multi-dim groupBy('service') + groupBy('path') over 1h produces the cartesian grouping with the right per-pair sums.
  • 2 negative cases shared via UsageBase (so both adapters exercise them): unknown column throws, groupBy without groupByInterval throws.

Verification

./vendor/bin/pint --test                              -> pass
./vendor/bin/phpstan analyse --level max src tests     -> No errors
./vendor/bin/phpunit                                   -> 209 / 209 passing

Notes / non-goals

  • The base utopia-php/query Query class is unchanged, by design.
  • Daily pre-aggregated MV path (getTimeSeries) is unchanged — that path has its own bucketing logic and is not exposed through UsageQuery::groupBy in this iteration.
  • orderAsc/Desc on a groupBy dim is allowed (passes through the existing orderBy plumbing); the dim is a real column at GROUP BY time so ClickHouse accepts the ORDER BY against it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant