From b203c75d43dca7ebf343e6d7d3f59f302b70fc2c Mon Sep 17 00:00:00 2001
From: Raymond Yee <raymond.yee@gmail.com>
Date: Fri, 8 May 2026 07:42:27 -0700
Subject: [PATCH] docs: add SEARCH_INDEX_V1.md (closes #169)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The v1 search-substrate contract — sample-centric document projection
with concept.label as a v1 minimum, BM25 ranking with precomputed DF +
doc_len, hash-by-token partitioning with a 5 MB shard cap, query-time
stopword policy distinct from build-time tokenizer, hard quality gate,
and a build_stats.json artifact requirement that prevents the contract
from drifting from the builder.

Companion to EXPLORER_STATE.md (UI-surface contract, option C).
Backend changes are intentionally orthogonal to the UI contract; this
doc pins one half, the other doc pins the other.

Refs #165, #170, #171, #172, #174.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 SEARCH_INDEX_V1.md | 287 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 287 insertions(+)
 create mode 100644 SEARCH_INDEX_V1.md

diff --git a/SEARCH_INDEX_V1.md b/SEARCH_INDEX_V1.md
new file mode 100644
index 00000000..291d0ca8
--- /dev/null
+++ b/SEARCH_INDEX_V1.md
@@ -0,0 +1,287 @@
+# Search Index v1 Contract
+
+Doc-only contract for the iSamples Explorer's full-text search substrate.
+Closes [#169](https://github.com/isamplesorg/isamplesorg.github.io/issues/169);
+inputs are consumed by the offline builder ([#170](https://github.com/isamplesorg/isamplesorg.github.io/issues/170)),
+the browser query prototype ([#171](https://github.com/isamplesorg/isamplesorg.github.io/issues/171)),
+and the GO/NO-GO gate ([#172](https://github.com/isamplesorg/isamplesorg.github.io/issues/172)).
+
+Companion to [`EXPLORER_STATE.md`](./EXPLORER_STATE.md): that doc pins the
+*UI surface* contract (option C — side-panel + result-pin overlay); this
+doc pins the *search backend* contract. The two are intentionally
+orthogonal.
+
+> **v1 ≠ destination.** v1 is a learning lap, not the finish line.
+> The substrate format admits content-only expansion to v1.5 (sampling
+> event/site fields) and v2 (Solr `searchText` parity) without schema
+> migration. Section 1 names every expansion vector; budgets in §7
+> apply to v1 content; v2 expansion gets a measurement-driven re-budget,
+> not a renegotiation of v1 numbers.
+
+---
+
+## 1. Sample search document projection
+
+The substrate is **not** "tokenize these parquet columns." It is "tokenize
+a *sample-centric document* whose text fragments are joined across the
+property graph and tagged by their entity origin." Each sample (`pid`)
+has a logical document of weighted text fragments. Build-time joins
+materialize the projection; tokens are tagged with the **virtual field
+name** (entity dot field), not the source parquet column.
+
+### v1 minimum (the projection that ships first)
+
+| virtual field         | source                                                              | rationale                                                  |
+|-----------------------|---------------------------------------------------------------------|------------------------------------------------------------|
+| `sample.label`        | `MaterialSampleRecord.label` (~6.68M coverage)                      | canonical title; near-universal                            |
+| `sample.description`  | `MaterialSampleRecord.description` (~1.61M ≈ 24%)                   | sparse but high-signal where present                       |
+| `sample.place_name`   | `samples_map_lite.parquet.place_name[]` (~2.21M)                    | already proven valuable in current ILIKE search            |
+| `concept.label`       | `material` / `context` / `object_type` URIs dereferenced via `vocab_labels.parquet` (`pref_label`, `lang=en`) | **load-bearing**: facet URIs are near-universal but raw URIs are useless to FTS; dereferenced labels make `pottery`, `ceramic`, `basalt`, `bone`, `marine` work as the user expects |
+
+A sample whose `material` URI is `<…>/Pottery` gets a row
+`{token: 'pottery', pid: …, field: 'concept.label', tf: 1, …}`. One row
+per sample per facet URI per token after tokenization.
+
+### v1.5 expansion (additive — no schema change)
+
+| virtual field                       | source (Solr equivalent)                                  |
+|-------------------------------------|-----------------------------------------------------------|
+| `event.label`                       | `producedBy_label` (~1.92M)                               |
+| `event.description`                 | `producedBy_description` (~5.54M ≈ 83%)                   |
+| `event.has_feature_of_interest`     | `producedBy_hasFeatureOfInterest` (~6.35M ≈ 95%)          |
+| `event.sampling_purpose`            | `producedBy_samplingPurpose` (~262K)                      |
+| `site.label`                        | `producedBy_samplingSite_label` (~190K)                   |
+| `site.description`                  | `producedBy_samplingSite_description` (~172K)             |
+| `site.place_name`                   | `producedBy_samplingSite_placeName[]` (~336K)             |
+
+### v2 / Solr `searchText` parity (named, not built)
+
+| virtual field            | source                                                |
+|--------------------------|-------------------------------------------------------|
+| `agent.name`             | registrant + responsibility agents                    |
+| `curation.label`         | `curation_label`                                      |
+| `curation.description`   | `curation_description`                                |
+| `curation.location`      | `curation_location`                                   |
+| `keywords`               | (if present)                                          |
+| `source`                 | `source` enum (low value as FTS — facet UI suffices)  |
+
+---
+
+## 2. Tokenizer (build-time)
+
+- Lowercase ASCII via `String.prototype.toLowerCase()` / Python `str.lower()`.
+- Unicode NFKC normalization.
+- Diacritic stripping via NFD + combining-mark removal.
+- Whitespace split, punctuation stripped, length filter (`1 ≤ len ≤ 64`).
+- **No stemming.** Honest limitation; document in UI copy.
+- **Index every token, including stopwords.** Stopword handling is
+  query-time, not build-time (§3) — keeps substrate flexible for future
+  phrase queries.
+- **Parallel implementations**: JS for browser query, Python for offline
+  build. Shared regression test set (≥ 30 strings, including diacritics,
+  mixed case, hyphenation, IGSN ids, archaeological place names).
+  CI fails on Python ↔ JS divergence.
+
+---
+
+## 3. Query-time policy (distinct from build-time)
+
+A separate axis from §2 — same tokenizer, but with additional query-only
+filtering applied to the user input *before* combining tokens.
+
+- **Tokenize the user input** with the same tokenizer used at build
+  (lowercase + NFKC + diacritic strip + whitespace split + length
+  filter). Round-trip invariant must hold.
+- **Drop English stopwords** from the bag-of-words AND. Curated list:
+  `a`, `an`, `the`, `of`, `from`, `for`, `to`, `in`, `on`, `at`, `is`,
+  `was`, `with`, `and`, `or`. Small, conservative, no language detection.
+  Rationale: a query like `pottery from Cyprus` should not require `from`
+  to match. Build-time skipping would lose phrase-query potential;
+  query-time skipping is reversible policy.
+- **AND-combine the surviving tokens.** Empty surviving set ⇒ controlled
+  empty state with helpful copy ("All terms in your query are common
+  words. Add a more specific term to search.") — never a full-corpus
+  dump or an error.
+- **No query-language syntax in v1.** No quoted phrases, no
+  field-prefix operators, no booleans, no negation, no wildcards, no
+  fuzzy. Documented v2 path: phrase quoting first (cheap and
+  high-value), then field-prefix and negation.
+
+---
+
+## 4. Substrate row schema
+
+Two parquet outputs per data version.
+
+**Token-row substrate** (the partitioned inverted index):
+
+```
+{
+  token:      VARCHAR    -- normalized token
+  pid:        VARCHAR    -- sample primary id
+  field:      VARCHAR    -- virtual field: 'sample.label' | 'sample.description'
+                         --   | 'sample.place_name' | 'concept.label'
+                         --   | (future: 'event.*' | 'site.*' | 'agent.*' …)
+  tf:         USMALLINT  -- term frequency in this (pid, field) pair
+  doc_len:    USMALLINT  -- token count of (pid, field) for BM25 length norm
+}
+```
+
+**Sidecar `df.parquet`** (global per-token document frequency):
+
+```
+{
+  token:  VARCHAR
+  df:     UINTEGER  -- documents (pid × field) containing this token
+}
+```
+
+Field weights are **query-side code**, not substrate data. Adding a
+v1.5 / v2 field is a build-pipeline scope change, not a schema migration.
+
+---
+
+## 5. Ranking
+
+BM25, fixed `k1 = 1.2`, `b = 0.75`. DF and `doc_len` from the substrate.
+Field weights live in query code:
+
+| field                | weight |
+|----------------------|--------|
+| `sample.label`       | 3.0    |
+| `concept.label`      | 2.5    |
+| `sample.place_name`  | 2.0    |
+| `sample.description` | 1.0    |
+
+Final result rank per `pid` = sum across `(pid, field)` BM25 contributions
+weighted by field weight. Top-K = 50 (matches the result-pin overlay cap
+in `EXPLORER_STATE.md` §6).
+
+---
+
+## 6. Partition shape
+
+- Hash-partition by token: `hash(token) % N` shards.
+- Per-shard byte cap: **≤ 5 MB** uncompressed parquet.
+- High-frequency token rule: if a single token's postings would exceed
+  the cap, sub-shard by `hash(pid) % M` within that token's logical
+  shard.
+- Number of top-level shards (`N`): start with **64**; refine in build
+  measurement.
+
+---
+
+## 7. Budgets
+
+These are **contract**. The GO/NO-GO gate (#172) is mechanical against
+this table.
+
+| metric                                          | target           |
+|-------------------------------------------------|------------------|
+| cold first search (P50)                         | ≤ 2 s            |
+| warm repeat-same-query search                   | ≤ 500 ms         |
+| warm new-query-after-warm-up search             | ≤ 500 ms         |
+| filter-composed cold search                     | ≤ 3 s            |
+| bytes transferred cold                          | ≤ 5 MB           |
+| bytes transferred warm                          | ≤ 1 MB per query |
+
+**"Warm" disambiguation** (resolves [#174](https://github.com/isamplesorg/isamplesorg.github.io/issues/174)):
+
+- *Warm-repeat-same-query*: same query, second invocation, same page.
+  Measures end-to-end cache + render path.
+- *Warm-new-query-after-warm-up*: different query, after parquet
+  metadata is already cached. Measures query execution after the
+  substrate file is warm.
+
+Both are reported by the benchmark; the budget targets above apply to both.
+
+Numeric thresholds for quality gates (top-K overlap percentages) are
+calibrated **after** [#167](https://github.com/isamplesorg/isamplesorg.github.io/issues/167) baseline lands and the prototype runs;
+they do not appear in this doc as fixed numbers. The contract requires
+the *shape* of the gate (top-3 vs hand-labeled, top-10 vs hand-labeled,
+top-10 vs DuckDB FTS oracle, plus the hard-fail invariants in §9), not
+the specific percentages.
+
+---
+
+## 8. Versioning
+
+- URL pattern: `https://data.isamples.org/isamples_YYYYMM_search_index_v1/<shard>.parquet`
+  with sidecar `df.parquet` and `build_stats.json` (§10).
+- Explorer pins to a specific `YYYYMM` so a dataset rebuild can't
+  break a deployed site mid-flight.
+- Index version is tied to data version. v1.x format bumps require
+  URL-path bump (`_v1` → `_v2`).
+
+---
+
+## 9. Curated benchmark + quality gate
+
+- File: `tests/search_benchmark.json`.
+- 12-15 queries, hand-labeled top-10 by Raymond. Must include:
+  - **bare-text queries** (`pottery`, `basalt`)
+  - **multi-term** (`pottery Cyprus`)
+  - **stopword-heavy** (`pottery from Cyprus`) — verifies §3 query-time
+    stopword policy
+  - **concept-only queries** (`ceramic`, `bone`, `mammal`) — verifies
+    dereferenced concept labels work; **fails loudly** if v1 ships
+    without concept labels
+  - **diacritic** (`Çatalhöyük`)
+  - **no-hit** (`xyzzyqqqplugh`)
+  - **filter-composed cases** (source-only, source + material) with
+    hand-labeled expected filtered top-K (per #172 hard-fail
+    invariant)
+- The gate is hard, not advisory:
+  - top-3 overlap vs hand-labeled set ≥ TBD%
+  - top-10 overlap vs hand-labeled set ≥ TBD%
+  - top-10 overlap vs DuckDB FTS local oracle (#171 §5) ≥ TBD%
+  - **zero concept-only benchmark queries return empty**
+  - **zero stopword-heavy queries return empty**
+  - **all hard-fail invariants in [#172](https://github.com/isamplesorg/isamplesorg.github.io/issues/172) pass**
+
+---
+
+## 10. Build-stats artifact (contract requirement)
+
+The v1 substrate build pipeline ([#170](https://github.com/isamplesorg/isamplesorg.github.io/issues/170)) MUST emit
+`build_stats.json` alongside the partitioned token-row parquets:
+
+```json
+{
+  "data_version": "isamples_YYYYMM",
+  "built_at_utc": "YYYY-MM-DDTHH:MM:SSZ",
+  "total_samples": <int>,
+  "fields": {
+    "sample.label":       { "samples_with_field": <int>, "total_tokens": <int>, "avg_doc_len": <float> },
+    "sample.description": { ... },
+    "sample.place_name":  { ... },
+    "concept.label":      { ... }
+  },
+  "concept_label_uri_resolution": {
+    "material_resolved": <fraction>, "material_missing_pref": <fraction>,
+    "context_resolved":  <fraction>, "context_missing_pref":  <fraction>,
+    "object_type_resolved": <fraction>, "object_type_missing_pref": <fraction>
+  },
+  "shard_count": <int>,
+  "shard_max_size_mb": <float>,
+  "total_bytes_uncompressed": <int>,
+  "build_seconds": <float>,
+  "top_df_tokens": [ ["the", <int>], ["of", <int>], ... ]
+}
+```
+
+This contract item exists so this doc and the builder cannot drift —
+every release of the substrate carries empirical coverage data, not
+narrative claims.
+
+---
+
+## 11. Out of scope (v1)
+
+- Solr-parity field set (§1 v2 expansion path; not implemented).
+- Stemming (English-specific, hurts non-English content; v2+ if at all).
+- Query-language syntax: quoted phrases, field operators, booleans,
+  negation, wildcards, fuzzy matching, ranges, boosts.
+- Hosted-search backend (a permanent contingency triggered by either
+  GO/NO-GO failure in #172 OR v2+ requirements that exceed what a
+  static substrate can deliver — see [#171 §7](https://github.com/isamplesorg/isamplesorg.github.io/issues/171)).