Rigorous, reproducible derived-parquet pipeline + AI-free tests (#273) by rdhyee · Pull Request #274 · isamplesorg/isamplesorg.github.io

rdhyee · 2026-06-06T01:09:55Z

Implements the rigorous, reproducible, AI-free-tested data pipeline tracked in #273.

What this delivers (all human-runnable: `make test` / `make all`)

scripts/build_frontend_derived.py (hardened rewrite): geometry-agnostic (WKB BLOB or DuckDB GEOMETRY); decorrelated concept resolution; first-non-root material (Issues with material types #265/build: stop SKOS root 'Material' leaking into the material facet (#265) #271); deterministic COPY ordering + tie-broken dominant_source; hard-fails on duplicate pids / concept row-ids; emits {tag}_manifest.json (input+output sha256, argv, git SHA, DuckDB+extension versions).
scripts/validate_frontend_derived.py — a semantic trust gate, not a spot check: asserts the derived-file algebra (facet_summaries == GROUP BY facets, facet_cross_filter == conditional GROUP BY, facets.pid == map_lite.pid, pid uniqueness, H3 sums, scheme), and with --wide re-derives from the source wide and diffs the written facets/map_lite/h3 (cells, counts, dominant_source, resolution, tolerant centers). Non-zero exit on any failure.
tests/test_frontend_derived.py — 12 fixture tests (no network/big data): both geometry encodings, material/concept/place_name cases, dup-pid hard-fail, manifest, wide_h3 correctness, and two tests that prove the gate catches corruption internal checks miss.
Makefile, scripts/requirements.txt (duckdb==1.4.4), .github/workflows/pipeline-tests.yml (CI fixture gate).
DATA_PROVENANCE.md + SERIALIZATIONS.md reconciled with reality.

Proven by execution (not by review)

Builder: 5.4 s on the real 20M-row 202604 wide (the prior approach was a >16-min perf blowup, killed).
--wide semantic gate: exits 0 on the real 202604 rebuild; exits 1 when data is corrupted (a test zeroes coordinates — passes internal checks, fails the gate).
Two-run determinism: facets / map_lite / summaries / cross_filter are byte-identical across runs.
The validator run against live prod correctly fails (346,768 root rows) — the bug is real and the gate detects it.

Process

Scope + correctness hardened by three adversarial Codex rounds — Codex literally proved an earlier validator passed a wrecked rebuild; that hole (and the follow-ups: h3 resolution/centers, scheme, wide_h3) is closed and each fix has a test. AI sign-off was not the gate — the executable tests are.

Threat-model note: the --wide gate imports the same build_base_tables, so it catches corruption / staleness / wrong-version of published artifacts; builder-logic correctness is covered separately by the fixture tests (which assert against hand-written expected values).

Notes

⚠️ Stacks on docs+scripts: data provenance map + build scripts for the 6 unscripted derived parquet #264 (introduces build_frontend_derived.py); diff collapses once docs+scripts: data provenance map + build scripts for the 6 unscripted derived parquet #264 merges.
Supersedes build: stop SKOS root 'Material' leaking into the material facet (#265) #271 — its material fix is folded in, correctly and performantly.
Does NOT publish to R2. The rebuilt 202606 files + manifest are generated and validated locally; the production publish (R2 upload + current/manifest.json cutover) is left for a human gate.

Closes #273 (pending the publish step, tracked there).

— 🤖 rbotyee (RY directing, out-of-office autonomous build). Codex adversarial ×3.

… derived parquet (a) DATA_PROVENANCE.md — end-to-end build chain (export → base PQG → sidecar merge → frontend derived → R2/Worker), per-stage script/command + the key constraint (the iSamples export is frozen — Central API offline since Aug 2025; new per-source data must come via the pid sidecar merge, not re-export). Folds the sidecar pattern (previously only in the Obsidian vault) into the repo. (c) scripts/build_frontend_derived.py — reproduces the 6 derived files that had no checked-in build (only ad-hoc notebook SQL): sample_facets_v2, samples_map_lite, wide_h3, h3_summary_res{4,6,8}, facet_summaries, facet_cross_filter — from one `wide` input (DuckDB + h3 + spatial). Has --validate-against to diff schema+counts vs published. Validated vs the published isamples_202601 files (built from 202604 wide): EXACT reproduction of sample_facets_v2 (5,980,282), samples_map_lite, and h3_summary_res4/6/8; all schemas match. facet_summaries (+3) and facet_cross_filter (+86) are schema-correct, with small deltas from the 202604-vs-202601 version gap + the original cross-filter pruning self-pairs (this build is an exhaustive superset) — can be reconciled if exact parity is needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…sts (isamplesorg#273) Rebuilds the Stage-4 derived-parquet pipeline as a real, tested, human-runnable system (no AI in the loop to trust). Closes defects found by EXECUTION that document/AI review missed. build_frontend_derived.py (rewrite): - geometry-agnostic (WKB BLOB *or* DuckDB GEOMETRY) — fixes the silent BinderException on 202601/Zenodo wides - decorrelated concept resolution (unnest+arg_min + joins) — fixes the MAP-cross-join perf blowup (>16 min -> 5.4 s on the 20M-row wide) - material = first NON-ROOT concept (isamplesorg#265/isamplesorg#271); deterministic COPY ORDER BY + tie-broken dominant_source + rounded centroids - strict CLI (unknown --only/--skip fails; --tag required) - emits {tag}_manifest.json: input/output sha256, argv, git SHA, DuckDB + extension versions (machine-checkable build identity) validate_frontend_derived.py (new, algebraic gate): - asserts the derived-file ALGEBRA, not spot checks: summaries == GROUP BY facets; cross_filter == conditional GROUP BY; facets.pid == map_lite.pid; pid uniqueness; H3 counts sum to map_lite; schema. Non-zero exit on failure. tests/test_frontend_derived.py (new): fixture unit tests over tiny synthetic wides (BLOB + GEOMETRY), material/concept/place_name/CLI cases. 6 tests. Makefile (wide/derived/validate/test/all), scripts/requirements.txt (duckdb pinned), .github/workflows/pipeline-tests.yml (CI fixture gate). DATA_PROVENANCE.md + SERIALIZATIONS.md reconciled with reality: Stage-4 now scripted; geometry contract; non-reproducibility of deployed 202601 facets (346,768 vs 528,983); version skew; h3 UBIGINT; cross_filter shape; first-non-root vs leaf. Scope hardened by adversarial Codex audit (epic isamplesorg#273). Supersedes isamplesorg#271. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…terminism (Codex round 2) Codex PROVED the validator passed a wrecked rebuild (corrupted material/coords/H3 with self-consistent summaries -> exit 0). Fixes: - validate_frontend_derived.py --wide: re-derive from the source wide and EXCEPT-diff the written facets/map_lite/h3 — catches corruption/stale/ wrong-version that internal consistency cannot. Proven by a new test that corrupts coords (passes internal checks, FAILS the --wide gate). Passes on the real 202604 rebuild. - builder HARD-fails on duplicate pids / duplicate concept row_ids (was a warning) - --threads option; determinism claim made honest: facets/map_lite/summaries/ cross_filter are byte-identical run-to-run (verified); float h3 centroids are display-only (compared on discrete cols only). - tests: semantic-gate-catches-corruption, dup-pid-hard-fail, manifest, wide_h3 (10 total) - docs: SERIALIZATIONS deployed-file caveat (202601 still has root rows) vs builder contract; DATA_PROVENANCE wide_h3 coverage precise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…wide_h3 correctness (Codex round 3) - validator --wide now also diffs h3 resolution (exact) + center_lat/lng (tolerant 1e-4: catches gross corruption, ignores float/thread last-ULP jitter) - facet_summaries.scheme contract checked (must be NULL) - wide_h3 cell correctness test (cross-checked vs map_lite) - tests prove h3 center/resolution corruption + scheme corruption are caught (12 total) Verified: 12 passed; real --wide gate exits 0 on the 202604 rebuild with the new checks; h3 center delta 1e-6 (well within 1e-4). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ance + verify manifest (Codex/workflow round 4) A proof workflow (independent re-exec + adversarial attack) found two real misses: 1. H3 centroids: shifting every cell center ~9m (8e-5 deg) passed the loose 1e-4 tolerance. Tightened to 1e-5 (~1m); residual undetected error now bounded at ~1m on display-only centroids. Re-running the exact attack now FAILS the gate. 2. manifest.json was never validated — corrupting its sha256 attestations passed. Validator now verifies every output file's sha256 (and the input's, with --wide) against the manifest. (Self-attesting, not signed — documented.) Both attacks re-run against the fixed gate now exit 1. Clean real rebuild still exits 0. 14 fixture tests (added regressions for both misses). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rdhyee · 2026-06-06T03:16:41Z

Adversarial proof round (workflow: independent re-exec + attack + verdict)

Ran a workflow that independently re-executes the pipeline and attacks it. It found two real misses, both now fixed + regression-tested (re-running the exact attacks against the fixed gate now exits 1; the clean rebuild still exits 0):

H3 centroids within tolerance — shifting every cell center ~9m (8e-5°) passed the loose 1e-4 tolerance. → tightened to 1e-5 (~1m); residual undetected error bounded at ~1m on display-only centroids.
manifest.json never validated — corrupting its sha256 attestations passed. → validator now verifies every output file's sha256 (and the input's, with --wide) against the manifest.

Honest verdict (NOT "foolproof")

Proven (by execution, on the real 292MB wide): no root-material leak, sentinel preserved, pid uniqueness + set-equality, summaries == GROUP BY facets, cross_filter == conditional GROUP BY, H3 sums + discrete cells + centroids, schema, manifest integrity, byte-identical re-runs (discrete files).

Documented boundaries (not bugs — scope):

The --wide gate imports the same build_base_tables, so it catches corruption/staleness/wrong-version but not builder-logic bugs — those are covered only by the fixture tests' hand-written expectations.
The manifest is self-attesting, not signed — catches corruption that didn't also rewrite the manifest, not a consistent file+manifest rewrite.
The deployed 202601 files are not reproducible (unrecorded pipeline).
The R2 publish + current/manifest.json cutover is human-gated (not automated).

Use it with confidence for gross-error + build-vs-deploy-divergence detection; it is not a complete tamper/provenance-attestation system without signing + automated publish.

rdhyee and others added 4 commits June 3, 2026 07:25

This was referenced Jun 6, 2026

Rigorous, reproducible data pipeline + AI-free runnable tests for derived parquet #273

Open

build: stop SKOS root 'Material' leaking into the material facet (#265) #271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rigorous, reproducible derived-parquet pipeline + AI-free tests (#273)#274

Rigorous, reproducible derived-parquet pipeline + AI-free tests (#273)#274
rdhyee wants to merge 5 commits into
isamplesorg:mainfrom
rdhyee:pipeline/rigorous-273

rdhyee commented Jun 6, 2026

Uh oh!

rdhyee commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rdhyee commented Jun 6, 2026

What this delivers (all human-runnable: make test / make all)

Proven by execution (not by review)

Process

Notes

Uh oh!

rdhyee commented Jun 6, 2026

Adversarial proof round (workflow: independent re-exec + attack + verdict)

Honest verdict (NOT "foolproof")

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What this delivers (all human-runnable: `make test` / `make all`)