Skip to content

Rigorous, reproducible data pipeline + AI-free runnable tests for derived parquet #273

@rdhyee

Description

@rdhyee

Epic / tracking issue. Goal: a reproducible data pipeline for the explorer's derived parquet, plus tests a human can run and trust without any AI in the loop. Scope hardened by an adversarial Codex audit (findings below). AI sign-off is explicitly not the gate — the executable test suite + human review are.

Why now

Building/verifying #271 (drop the SKOS root "Material" leak) surfaced defects that document review and AI review did NOT catch — only execution did. The Stage-4 "frontend derived" pipeline (DATA_PROVENANCE.md) is ad-hoc, unpinned, and untested.

Evidence (executable, AI-free)

scripts/validate_frontend_derived.py run against live production today → exit code 1:

facets: no root 'Material' value   FAIL  346,768 rows still = root (want 0)
summaries: no root material row     FAIL  1
cross_filter: no root material      FAIL  23
ark:/28722/k2p55x96j preserved      PASS
facets: non-empty                   PASS  5,980,282 rows

Defects to fix/track

Reproducibility & provenance

  • Deployed 202601 facets are not reproducible: rebuild from any available wide gives 528,983 root-material rows vs deployed 346,768; exact prod invocation unrecorded.
  • No machine-checkable build manifest (input checksum, argv, git SHA, DuckDB + extension versions, output schemas/row-counts/checksums).
  • Version skew: derived files 202601 vs canonical wide 202604; stale default --tag.

Builder correctness (scripts/build_frontend_derived.py)

Schema / contract drift

Validator is too weak (scripts/validate_frontend_derived.py)

Docs

Acceptance — tests assert the derived-file algebra, not spot checks

A wrong rebuild must FAIL. Required (all human-runnable, make test / pytest, non-zero exit on failure):

  • facet_summaries == GROUP BY sample_facets_v2; facet_cross_filter == conditional GROUP BY sample_facets_v2; sample_facets_v2.pid == samples_map_lite.pid
  • PID uniqueness on every pid-keyed file
  • Exact schema tests (types, column order, nullability, value ranges)
  • H3: int/hex equivalence, resolution correctness, summary counts sum to geo sample count, deterministic tie policy
  • Geometry fixtures: WKB BLOB, GEOMETRY, null, invalid, non-point, out-of-range
  • Concept-resolution fixtures (material/context/object_type; missing row-id; root policy) — root-first / root-only / real-first / NULL array / missing-id
  • place_name fixtures (null/empty/single/multi/quotes/serialization)
  • CLI failure tests (bad --only fails; missing output fails; --validate-against exits non-zero on mismatch)
  • Perf gate: fixture unit tests in CI + marked full/sampled smoke with a wall-clock budget
  • Publish/cutover: no mixed 202601/202604 artifacts unless documented; current/manifest.json coherent; remote HEAD/checksum checks

Workstreams (Codex-recommended split — file as sub-issues if/when picked up)

  1. Provenance manifest + canonical input fetch/verify
  2. Builder hardening (geometry, concept resolution, determinism, CLI failures)
  3. Data-contract pytest suite over fixtures
  4. Full-output validator over real parquet (algebraic consistency)
  5. Dependency / container pinning (DuckDB, spatial, H3)
  6. Docs reconciliation (DATA_PROVENANCE.md, SERIALIZATIONS.md)
  7. CI + optional publish/cutover automation

Related

#271 (material fix — carries the perf regression above) · #272 (OC sidecar) · #268/#264 (provenance docs) · #265/#260 (reports that surfaced this) · #131 #135 #138


Scope hardened by an adversarial Codex audit (20 defects + algebra-not-spot-checks). 🤖 rbotyee; RY directing. Process: Codex attacks, executable tests gate, human approves.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions