Rigorous, reproducible derived-parquet pipeline + AI-free tests (#273)#274
Open
rdhyee wants to merge 5 commits into
Open
Rigorous, reproducible derived-parquet pipeline + AI-free tests (#273)#274rdhyee wants to merge 5 commits into
rdhyee wants to merge 5 commits into
Conversation
… derived parquet
(a) DATA_PROVENANCE.md — end-to-end build chain (export → base PQG → sidecar
merge → frontend derived → R2/Worker), per-stage script/command + the key
constraint (the iSamples export is frozen — Central API offline since Aug 2025;
new per-source data must come via the pid sidecar merge, not re-export). Folds
the sidecar pattern (previously only in the Obsidian vault) into the repo.
(c) scripts/build_frontend_derived.py — reproduces the 6 derived files that had
no checked-in build (only ad-hoc notebook SQL): sample_facets_v2, samples_map_lite,
wide_h3, h3_summary_res{4,6,8}, facet_summaries, facet_cross_filter — from one
`wide` input (DuckDB + h3 + spatial). Has --validate-against to diff schema+counts
vs published.
Validated vs the published isamples_202601 files (built from 202604 wide):
EXACT reproduction of sample_facets_v2 (5,980,282), samples_map_lite, and
h3_summary_res4/6/8; all schemas match. facet_summaries (+3) and
facet_cross_filter (+86) are schema-correct, with small deltas from the
202604-vs-202601 version gap + the original cross-filter pruning self-pairs
(this build is an exhaustive superset) — can be reconciled if exact parity is needed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sts (isamplesorg#273) Rebuilds the Stage-4 derived-parquet pipeline as a real, tested, human-runnable system (no AI in the loop to trust). Closes defects found by EXECUTION that document/AI review missed. build_frontend_derived.py (rewrite): - geometry-agnostic (WKB BLOB *or* DuckDB GEOMETRY) — fixes the silent BinderException on 202601/Zenodo wides - decorrelated concept resolution (unnest+arg_min + joins) — fixes the MAP-cross-join perf blowup (>16 min -> 5.4 s on the 20M-row wide) - material = first NON-ROOT concept (isamplesorg#265/isamplesorg#271); deterministic COPY ORDER BY + tie-broken dominant_source + rounded centroids - strict CLI (unknown --only/--skip fails; --tag required) - emits {tag}_manifest.json: input/output sha256, argv, git SHA, DuckDB + extension versions (machine-checkable build identity) validate_frontend_derived.py (new, algebraic gate): - asserts the derived-file ALGEBRA, not spot checks: summaries == GROUP BY facets; cross_filter == conditional GROUP BY; facets.pid == map_lite.pid; pid uniqueness; H3 counts sum to map_lite; schema. Non-zero exit on failure. tests/test_frontend_derived.py (new): fixture unit tests over tiny synthetic wides (BLOB + GEOMETRY), material/concept/place_name/CLI cases. 6 tests. Makefile (wide/derived/validate/test/all), scripts/requirements.txt (duckdb pinned), .github/workflows/pipeline-tests.yml (CI fixture gate). DATA_PROVENANCE.md + SERIALIZATIONS.md reconciled with reality: Stage-4 now scripted; geometry contract; non-reproducibility of deployed 202601 facets (346,768 vs 528,983); version skew; h3 UBIGINT; cross_filter shape; first-non-root vs leaf. Scope hardened by adversarial Codex audit (epic isamplesorg#273). Supersedes isamplesorg#271. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…terminism (Codex round 2) Codex PROVED the validator passed a wrecked rebuild (corrupted material/coords/H3 with self-consistent summaries -> exit 0). Fixes: - validate_frontend_derived.py --wide: re-derive from the source wide and EXCEPT-diff the written facets/map_lite/h3 — catches corruption/stale/ wrong-version that internal consistency cannot. Proven by a new test that corrupts coords (passes internal checks, FAILS the --wide gate). Passes on the real 202604 rebuild. - builder HARD-fails on duplicate pids / duplicate concept row_ids (was a warning) - --threads option; determinism claim made honest: facets/map_lite/summaries/ cross_filter are byte-identical run-to-run (verified); float h3 centroids are display-only (compared on discrete cols only). - tests: semantic-gate-catches-corruption, dup-pid-hard-fail, manifest, wide_h3 (10 total) - docs: SERIALIZATIONS deployed-file caveat (202601 still has root rows) vs builder contract; DATA_PROVENANCE wide_h3 coverage precise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…wide_h3 correctness (Codex round 3) - validator --wide now also diffs h3 resolution (exact) + center_lat/lng (tolerant 1e-4: catches gross corruption, ignores float/thread last-ULP jitter) - facet_summaries.scheme contract checked (must be NULL) - wide_h3 cell correctness test (cross-checked vs map_lite) - tests prove h3 center/resolution corruption + scheme corruption are caught (12 total) Verified: 12 passed; real --wide gate exits 0 on the 202604 rebuild with the new checks; h3 center delta 1e-6 (well within 1e-4). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 6, 2026
…ance + verify manifest (Codex/workflow round 4) A proof workflow (independent re-exec + adversarial attack) found two real misses: 1. H3 centroids: shifting every cell center ~9m (8e-5 deg) passed the loose 1e-4 tolerance. Tightened to 1e-5 (~1m); residual undetected error now bounded at ~1m on display-only centroids. Re-running the exact attack now FAILS the gate. 2. manifest.json was never validated — corrupting its sha256 attestations passed. Validator now verifies every output file's sha256 (and the input's, with --wide) against the manifest. (Self-attesting, not signed — documented.) Both attacks re-run against the fixed gate now exit 1. Clean real rebuild still exits 0. 14 fixture tests (added regressions for both misses). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
Adversarial proof round (workflow: independent re-exec + attack + verdict)Ran a workflow that independently re-executes the pipeline and attacks it. It found two real misses, both now fixed + regression-tested (re-running the exact attacks against the fixed gate now exits 1; the clean rebuild still exits 0):
Honest verdict (NOT "foolproof")Proven (by execution, on the real 292MB wide): no root-material leak, sentinel preserved, pid uniqueness + set-equality, Documented boundaries (not bugs — scope):
Use it with confidence for gross-error + build-vs-deploy-divergence detection; it is not a complete tamper/provenance-attestation system without signing + automated publish. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the rigorous, reproducible, AI-free-tested data pipeline tracked in #273.
What this delivers (all human-runnable:
make test/make all)scripts/build_frontend_derived.py(hardened rewrite): geometry-agnostic (WKB BLOB or DuckDB GEOMETRY); decorrelated concept resolution; first-non-root material (Issues with material types #265/build: stop SKOS root 'Material' leaking into the material facet (#265) #271); deterministic COPY ordering + tie-brokendominant_source; hard-fails on duplicate pids / concept row-ids; emits{tag}_manifest.json(input+output sha256, argv, git SHA, DuckDB+extension versions).scripts/validate_frontend_derived.py— a semantic trust gate, not a spot check: asserts the derived-file algebra (facet_summaries == GROUP BY facets,facet_cross_filter == conditional GROUP BY,facets.pid == map_lite.pid, pid uniqueness, H3 sums, scheme), and with--widere-derives from the source wide and diffs the written facets/map_lite/h3 (cells, counts, dominant_source, resolution, tolerant centers). Non-zero exit on any failure.tests/test_frontend_derived.py— 12 fixture tests (no network/big data): both geometry encodings, material/concept/place_name cases, dup-pid hard-fail, manifest, wide_h3 correctness, and two tests that prove the gate catches corruption internal checks miss.Makefile,scripts/requirements.txt(duckdb==1.4.4),.github/workflows/pipeline-tests.yml(CI fixture gate).DATA_PROVENANCE.md+SERIALIZATIONS.mdreconciled with reality.Proven by execution (not by review)
--widesemantic gate: exits 0 on the real 202604 rebuild; exits 1 when data is corrupted (a test zeroes coordinates — passes internal checks, fails the gate).Process
Scope + correctness hardened by three adversarial Codex rounds — Codex literally proved an earlier validator passed a wrecked rebuild; that hole (and the follow-ups: h3 resolution/centers, scheme, wide_h3) is closed and each fix has a test. AI sign-off was not the gate — the executable tests are.
Threat-model note: the
--widegate imports the samebuild_base_tables, so it catches corruption / staleness / wrong-version of published artifacts; builder-logic correctness is covered separately by the fixture tests (which assert against hand-written expected values).Notes
build_frontend_derived.py); diff collapses once docs+scripts: data provenance map + build scripts for the 6 unscripted derived parquet #264 merges.current/manifest.jsoncutover) is left for a human gate.Closes #273 (pending the publish step, tracked there).
— 🤖 rbotyee (RY directing, out-of-office autonomous build). Codex adversarial ×3.