Rigorous, reproducible data pipeline + AI-free runnable tests for derived parquet

**Epic / tracking issue.** Goal: a **reproducible data pipeline** for the explorer's derived parquet, plus **tests a human can run and trust without any AI in the loop**. Scope hardened by an adversarial Codex audit (findings below). AI sign-off is explicitly **not** the gate — the executable test suite + human review are.

## Why now
Building/verifying #271 (drop the SKOS root "Material" leak) surfaced defects that **document review and AI review did NOT catch — only execution did.** The Stage-4 "frontend derived" pipeline (`DATA_PROVENANCE.md`) is ad-hoc, unpinned, and untested.

## Evidence (executable, AI-free)
`scripts/validate_frontend_derived.py` run against **live production** today → exit code **1**:
```
facets: no root 'Material' value   FAIL  346,768 rows still = root (want 0)
summaries: no root material row     FAIL  1
cross_filter: no root material      FAIL  23
ark:/28722/k2p55x96j preserved      PASS
facets: non-empty                   PASS  5,980,282 rows
```

## Defects to fix/track

**Reproducibility & provenance**
- [ ] Deployed `202601` facets are **not reproducible**: rebuild from any available wide gives **528,983** root-material rows vs deployed **346,768**; exact prod invocation unrecorded.
- [ ] No machine-checkable **build manifest** (input checksum, argv, git SHA, DuckDB + extension versions, output schemas/row-counts/checksums).
- [ ] **Version skew**: derived files `202601` vs canonical wide `202604`; stale default `--tag`.

**Builder correctness** (`scripts/build_frontend_derived.py`)
- [ ] **Geometry input-contract**: requires WKB BLOB (`ST_GeomFromWKB`); `202601`/Zenodo wides are `GEOMETRY` → `BinderException`. Make geometry-agnostic.
- [ ] **Perf regression (#271)**: `MAP` cross-join mixed with correlated subqueries → planner blowup (base build >16 min, killed). [Codex #2] `context`/`object_type` still use per-row correlated lookups. Resolve all 3 concept columns by one consistent, decorrelated method.
- [ ] [Codex #1] `--only`/`--skip` not isolated — always builds `samp`/`samp_geo` + H3/spatial before honoring them.
- [ ] [Codex #3] Unknown `--only`/`--skip` names silently succeed (typo → emits nothing, exit 0).
- [ ] [Codex #5] Concept resolution silently drops broken refs (missing `IdentifiedConcept` row-ids → NULL, no integrity threshold).
- [ ] [Codex #13] **Not byte-reproducible**: no `ORDER BY` on COPY; `MODE(source)` nondeterministic on ties; float `AVG` parallel-variance.
- [ ] [Codex #9] PID uniqueness assumed by browser, not enforced by build.
- [ ] [Codex #11] `facet_cross_filter` emits self-dimension rows the UI ignores (counts should exclude the active dimension).

**Schema / contract drift**
- [ ] [Codex #8] `place_name` unstable: `VARCHAR[]` → cast to VARCHAR in facets but stays array in `samples_map_lite`; docs say VARCHAR.
- [ ] [Codex #10] `facet_cross_filter` baseline rows have all `filter_*` NULL, contradicting `SERIALIZATIONS.md` ("exactly one non-null").
- [ ] [Codex #14] H3 summary types: `COUNT(*)`/`UBIGINT` vs docs' `INT`/`BIGINT`.
- [ ] [Codex #6] Concept-selection contract contradiction: script = "first-remaining" vs `SERIALIZATIONS.md` = "leaf concept".

**Validator is too weak** (`scripts/validate_frontend_derived.py`)
- [ ] [Codex #17] Can pass very wrong data (no-root + 1 sentinel + >1M rows + >50% populated passes even if materials collapsed/counts wrong).
- [ ] [Codex #12] Never recomputes `facet_summaries`/`facet_cross_filter` from `sample_facets_v2` → drift uncaught.
- [ ] [Codex #18] Sentinel PID check not uniqueness-safe (`fetchone()` can mask a wrong duplicate).
- [ ] [Codex #7] Only checks material root; context/object_type root leakage untested.
- [ ] [Codex #16] Narrow scope — ignores `samples_map_lite`, H3 summaries, vocab_labels, manifest, `current` aliases.
- [ ] [Codex #4] `--validate-against` is print-only, not a gate (skips missing files, compares only column names, never exits non-zero).

**Docs**
- [ ] [Codex #19] `DATA_PROVENANCE.md` stale (says Stage 4 ad-hoc / "no build script" while shipping it).
- [ ] [Codex #15/#20] Pin deps (DuckDB + h3/spatial versions, currently community-installed at runtime); reconcile `SERIALIZATIONS.md`.

## Acceptance — tests assert the derived-file *algebra*, not spot checks
A wrong rebuild must FAIL. Required (all human-runnable, `make test` / `pytest`, non-zero exit on failure):
- [ ] `facet_summaries == GROUP BY sample_facets_v2`; `facet_cross_filter == conditional GROUP BY sample_facets_v2`; `sample_facets_v2.pid == samples_map_lite.pid`
- [ ] PID uniqueness on every pid-keyed file
- [ ] Exact schema tests (types, column order, nullability, value ranges)
- [ ] H3: int/hex equivalence, resolution correctness, summary counts sum to geo sample count, deterministic tie policy
- [ ] Geometry fixtures: WKB BLOB, GEOMETRY, null, invalid, non-point, out-of-range
- [ ] Concept-resolution fixtures (material/context/object_type; missing row-id; root policy) — root-first / root-only / real-first / NULL array / missing-id
- [ ] `place_name` fixtures (null/empty/single/multi/quotes/serialization)
- [ ] CLI failure tests (bad `--only` fails; missing output fails; `--validate-against` exits non-zero on mismatch)
- [ ] Perf gate: fixture unit tests in CI + marked full/sampled smoke with a wall-clock budget
- [ ] Publish/cutover: no mixed `202601`/`202604` artifacts unless documented; `current/manifest.json` coherent; remote HEAD/checksum checks

## Workstreams (Codex-recommended split — file as sub-issues if/when picked up)
1. Provenance manifest + canonical input fetch/verify
2. Builder hardening (geometry, concept resolution, determinism, CLI failures)
3. Data-contract pytest suite over fixtures
4. Full-output validator over real parquet (algebraic consistency)
5. Dependency / container pinning (DuckDB, spatial, H3)
6. Docs reconciliation (`DATA_PROVENANCE.md`, `SERIALIZATIONS.md`)
7. CI + optional publish/cutover automation

## Related
#271 (material fix — carries the perf regression above) · #272 (OC sidecar) · #268/#264 (provenance docs) · #265/#260 (reports that surfaced this) · #131 #135 #138

---
*Scope hardened by an adversarial Codex audit (20 defects + algebra-not-spot-checks). 🤖 rbotyee; RY directing. Process: Codex attacks, executable tests gate, human approves.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rigorous, reproducible data pipeline + AI-free runnable tests for derived parquet #273

Why now

Evidence (executable, AI-free)

Defects to fix/track

Acceptance — tests assert the derived-file algebra, not spot checks

Workstreams (Codex-recommended split — file as sub-issues if/when picked up)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Rigorous, reproducible data pipeline + AI-free runnable tests for derived parquet #273

Description

Why now

Evidence (executable, AI-free)

Defects to fix/track

Acceptance — tests assert the derived-file algebra, not spot checks

Workstreams (Codex-recommended split — file as sub-issues if/when picked up)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions