Skip to content

Rigorous, reproducible derived-parquet pipeline + AI-free tests (#273)#274

Open
rdhyee wants to merge 5 commits into
isamplesorg:mainfrom
rdhyee:pipeline/rigorous-273
Open

Rigorous, reproducible derived-parquet pipeline + AI-free tests (#273)#274
rdhyee wants to merge 5 commits into
isamplesorg:mainfrom
rdhyee:pipeline/rigorous-273

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Jun 6, 2026

Implements the rigorous, reproducible, AI-free-tested data pipeline tracked in #273.

What this delivers (all human-runnable: make test / make all)

  • scripts/build_frontend_derived.py (hardened rewrite): geometry-agnostic (WKB BLOB or DuckDB GEOMETRY); decorrelated concept resolution; first-non-root material (Issues with material types #265/build: stop SKOS root 'Material' leaking into the material facet (#265) #271); deterministic COPY ordering + tie-broken dominant_source; hard-fails on duplicate pids / concept row-ids; emits {tag}_manifest.json (input+output sha256, argv, git SHA, DuckDB+extension versions).
  • scripts/validate_frontend_derived.py — a semantic trust gate, not a spot check: asserts the derived-file algebra (facet_summaries == GROUP BY facets, facet_cross_filter == conditional GROUP BY, facets.pid == map_lite.pid, pid uniqueness, H3 sums, scheme), and with --wide re-derives from the source wide and diffs the written facets/map_lite/h3 (cells, counts, dominant_source, resolution, tolerant centers). Non-zero exit on any failure.
  • tests/test_frontend_derived.py — 12 fixture tests (no network/big data): both geometry encodings, material/concept/place_name cases, dup-pid hard-fail, manifest, wide_h3 correctness, and two tests that prove the gate catches corruption internal checks miss.
  • Makefile, scripts/requirements.txt (duckdb==1.4.4), .github/workflows/pipeline-tests.yml (CI fixture gate).
  • DATA_PROVENANCE.md + SERIALIZATIONS.md reconciled with reality.

Proven by execution (not by review)

  • Builder: 5.4 s on the real 20M-row 202604 wide (the prior approach was a >16-min perf blowup, killed).
  • --wide semantic gate: exits 0 on the real 202604 rebuild; exits 1 when data is corrupted (a test zeroes coordinates — passes internal checks, fails the gate).
  • Two-run determinism: facets / map_lite / summaries / cross_filter are byte-identical across runs.
  • The validator run against live prod correctly fails (346,768 root rows) — the bug is real and the gate detects it.

Process

Scope + correctness hardened by three adversarial Codex rounds — Codex literally proved an earlier validator passed a wrecked rebuild; that hole (and the follow-ups: h3 resolution/centers, scheme, wide_h3) is closed and each fix has a test. AI sign-off was not the gate — the executable tests are.

Threat-model note: the --wide gate imports the same build_base_tables, so it catches corruption / staleness / wrong-version of published artifacts; builder-logic correctness is covered separately by the fixture tests (which assert against hand-written expected values).

Notes

Closes #273 (pending the publish step, tracked there).


— 🤖 rbotyee (RY directing, out-of-office autonomous build). Codex adversarial ×3.

rdhyee and others added 4 commits June 3, 2026 07:25
… derived parquet

(a) DATA_PROVENANCE.md — end-to-end build chain (export → base PQG → sidecar
merge → frontend derived → R2/Worker), per-stage script/command + the key
constraint (the iSamples export is frozen — Central API offline since Aug 2025;
new per-source data must come via the pid sidecar merge, not re-export). Folds
the sidecar pattern (previously only in the Obsidian vault) into the repo.

(c) scripts/build_frontend_derived.py — reproduces the 6 derived files that had
no checked-in build (only ad-hoc notebook SQL): sample_facets_v2, samples_map_lite,
wide_h3, h3_summary_res{4,6,8}, facet_summaries, facet_cross_filter — from one
`wide` input (DuckDB + h3 + spatial). Has --validate-against to diff schema+counts
vs published.

Validated vs the published isamples_202601 files (built from 202604 wide):
EXACT reproduction of sample_facets_v2 (5,980,282), samples_map_lite, and
h3_summary_res4/6/8; all schemas match. facet_summaries (+3) and
facet_cross_filter (+86) are schema-correct, with small deltas from the
202604-vs-202601 version gap + the original cross-filter pruning self-pairs
(this build is an exhaustive superset) — can be reconciled if exact parity is needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sts (isamplesorg#273)

Rebuilds the Stage-4 derived-parquet pipeline as a real, tested, human-runnable
system (no AI in the loop to trust). Closes defects found by EXECUTION that
document/AI review missed.

build_frontend_derived.py (rewrite):
- geometry-agnostic (WKB BLOB *or* DuckDB GEOMETRY) — fixes the silent
  BinderException on 202601/Zenodo wides
- decorrelated concept resolution (unnest+arg_min + joins) — fixes the
  MAP-cross-join perf blowup (>16 min -> 5.4 s on the 20M-row wide)
- material = first NON-ROOT concept (isamplesorg#265/isamplesorg#271); deterministic COPY ORDER BY +
  tie-broken dominant_source + rounded centroids
- strict CLI (unknown --only/--skip fails; --tag required)
- emits {tag}_manifest.json: input/output sha256, argv, git SHA, DuckDB +
  extension versions (machine-checkable build identity)

validate_frontend_derived.py (new, algebraic gate):
- asserts the derived-file ALGEBRA, not spot checks: summaries == GROUP BY
  facets; cross_filter == conditional GROUP BY; facets.pid == map_lite.pid;
  pid uniqueness; H3 counts sum to map_lite; schema. Non-zero exit on failure.

tests/test_frontend_derived.py (new): fixture unit tests over tiny synthetic
wides (BLOB + GEOMETRY), material/concept/place_name/CLI cases. 6 tests.

Makefile (wide/derived/validate/test/all), scripts/requirements.txt (duckdb
pinned), .github/workflows/pipeline-tests.yml (CI fixture gate).

DATA_PROVENANCE.md + SERIALIZATIONS.md reconciled with reality: Stage-4 now
scripted; geometry contract; non-reproducibility of deployed 202601 facets
(346,768 vs 528,983); version skew; h3 UBIGINT; cross_filter shape;
first-non-root vs leaf.

Scope hardened by adversarial Codex audit (epic isamplesorg#273). Supersedes isamplesorg#271.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…terminism (Codex round 2)

Codex PROVED the validator passed a wrecked rebuild (corrupted material/coords/H3
with self-consistent summaries -> exit 0). Fixes:
- validate_frontend_derived.py --wide: re-derive from the source wide and
  EXCEPT-diff the written facets/map_lite/h3 — catches corruption/stale/
  wrong-version that internal consistency cannot. Proven by a new test that
  corrupts coords (passes internal checks, FAILS the --wide gate). Passes on the
  real 202604 rebuild.
- builder HARD-fails on duplicate pids / duplicate concept row_ids (was a warning)
- --threads option; determinism claim made honest: facets/map_lite/summaries/
  cross_filter are byte-identical run-to-run (verified); float h3 centroids are
  display-only (compared on discrete cols only).
- tests: semantic-gate-catches-corruption, dup-pid-hard-fail, manifest, wide_h3 (10 total)
- docs: SERIALIZATIONS deployed-file caveat (202601 still has root rows) vs
  builder contract; DATA_PROVENANCE wide_h3 coverage precise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…wide_h3 correctness (Codex round 3)

- validator --wide now also diffs h3 resolution (exact) + center_lat/lng
  (tolerant 1e-4: catches gross corruption, ignores float/thread last-ULP jitter)
- facet_summaries.scheme contract checked (must be NULL)
- wide_h3 cell correctness test (cross-checked vs map_lite)
- tests prove h3 center/resolution corruption + scheme corruption are caught (12 total)

Verified: 12 passed; real --wide gate exits 0 on the 202604 rebuild with the new
checks; h3 center delta 1e-6 (well within 1e-4).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ance + verify manifest (Codex/workflow round 4)

A proof workflow (independent re-exec + adversarial attack) found two real misses:
1. H3 centroids: shifting every cell center ~9m (8e-5 deg) passed the loose 1e-4
   tolerance. Tightened to 1e-5 (~1m); residual undetected error now bounded at
   ~1m on display-only centroids. Re-running the exact attack now FAILS the gate.
2. manifest.json was never validated — corrupting its sha256 attestations passed.
   Validator now verifies every output file's sha256 (and the input's, with
   --wide) against the manifest. (Self-attesting, not signed — documented.)

Both attacks re-run against the fixed gate now exit 1. Clean real rebuild still
exits 0. 14 fixture tests (added regressions for both misses).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Jun 6, 2026

Adversarial proof round (workflow: independent re-exec + attack + verdict)

Ran a workflow that independently re-executes the pipeline and attacks it. It found two real misses, both now fixed + regression-tested (re-running the exact attacks against the fixed gate now exits 1; the clean rebuild still exits 0):

  1. H3 centroids within tolerance — shifting every cell center ~9m (8e-5°) passed the loose 1e-4 tolerance. → tightened to 1e-5 (~1m); residual undetected error bounded at ~1m on display-only centroids.
  2. manifest.json never validated — corrupting its sha256 attestations passed. → validator now verifies every output file's sha256 (and the input's, with --wide) against the manifest.

Honest verdict (NOT "foolproof")

Proven (by execution, on the real 292MB wide): no root-material leak, sentinel preserved, pid uniqueness + set-equality, summaries == GROUP BY facets, cross_filter == conditional GROUP BY, H3 sums + discrete cells + centroids, schema, manifest integrity, byte-identical re-runs (discrete files).

Documented boundaries (not bugs — scope):

  • The --wide gate imports the same build_base_tables, so it catches corruption/staleness/wrong-version but not builder-logic bugs — those are covered only by the fixture tests' hand-written expectations.
  • The manifest is self-attesting, not signed — catches corruption that didn't also rewrite the manifest, not a consistent file+manifest rewrite.
  • The deployed 202601 files are not reproducible (unrecorded pipeline).
  • The R2 publish + current/manifest.json cutover is human-gated (not automated).

Use it with confidence for gross-error + build-vs-deploy-divergence detection; it is not a complete tamper/provenance-attestation system without signing + automated publish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rigorous, reproducible data pipeline + AI-free runnable tests for derived parquet

1 participant