Explorer FTS Track 2: search_index_v1 contract doc

> **Updated 2026-05-08** per Codex review on #165. Major changes: refactored around a sample-centric document projection (Section 1), v1 minimum now includes dereferenced concept labels, query-time policy split from build-time tokenizer (new Section 3), quality gate hardened (Section 8). Original framing preserved in git history.

Sub-issue of #165. No code dependencies; can start in parallel with #N1 (#167).

## Goal

Land a single design doc — `SEARCH_INDEX_V1.md` — that pins the v1 substrate contract before any pipeline or query code is written. Tracks 3-5 implement and measure against this doc.

## Required contents

### 1. Sample search document projection

The substrate is **not** "tokenize these parquet columns." It is "tokenize a *sample-centric document* whose text fragments are joined across the property graph and tagged by their entity origin."

Each sample (`pid`) has a logical document of weighted text fragments. At build time, a join across the wide parquet (or its narrow equivalents) produces this projection; the substrate then tokenizes per fragment, tagging each token row with the **virtual field name** (entity dot field), not the source parquet column.

**v1 minimum** — the projection that ships first:

| virtual field         | source                                                              | rationale                                                  |
|-----------------------|----------------------------------------------------------------------|------------------------------------------------------------|
| `sample.label`        | `MaterialSampleRecord.label` (~6.68M coverage)                       | canonical title; near-universal                            |
| `sample.description`  | `MaterialSampleRecord.description` (~1.61M ≈ 24%)                    | sparse but high-signal where present                       |
| `sample.place_name`   | `samples_map_lite.parquet.place_name[]` (~2.21M)                     | already proven valuable in current ILIKE search            |
| `concept.label`       | `material` / `context` / `object_type` URIs dereferenced via `vocab_labels.parquet` (`pref_label`, lang=en) | **load-bearing addition**: facet URIs are near-universal but raw URIs are useless to FTS; dereferenced labels make `pottery`, `ceramic`, `basalt`, `bone`, `marine` work as the user expects |

A sample whose `material` URI is `<…>/Pottery` gets a row `{token: 'pottery', pid: ..., field: 'concept.label', tf: 1, ...}`. One row per sample per facet URI per token. Coverage: most of the dataset.

**v1.5 expansion** (post-v1, additive — no schema change):

| virtual field                       | source (Solr equivalent)                                  |
|-------------------------------------|-----------------------------------------------------------|
| `event.label`                       | `producedBy_label` (~1.92M)                               |
| `event.description`                 | `producedBy_description` (~5.54M ≈ 83%)                   |
| `event.has_feature_of_interest`     | `producedBy_hasFeatureOfInterest` (~6.35M ≈ 95%)          |
| `event.sampling_purpose`            | `producedBy_samplingPurpose` (~262K)                      |
| `site.label`                        | `producedBy_samplingSite_label` (~190K)                   |
| `site.description`                  | `producedBy_samplingSite_description` (~172K)             |
| `site.place_name`                   | `producedBy_samplingSite_placeName[]` (~336K rows)        |

**v2 / Solr `searchText` parity** (named, not built):

| virtual field            | source                                                |
|--------------------------|-------------------------------------------------------|
| `agent.name`             | registrant + responsibility agents                    |
| `curation.label`         | `curation_label`                                      |
| `curation.description`   | `curation_description`                                |
| `curation.location`      | `curation_location`                                   |
| `keywords`               | (if present)                                          |
| `source`                 | `source` (already a facet; low-value as FTS)          |

### 2. Tokenizer (build-time)

- Lowercase ASCII via `String.prototype.toLowerCase()` / Python `str.lower()`.
- Unicode NFKC normalization.
- Diacritic stripping via NFD + combining-mark removal.
- Whitespace split, punctuation stripped, length filter (`1 ≤ len ≤ 64`).
- **No stemming.** Honest limitation; document in UI copy.
- **Index every token, including stopwords.** Stopword handling is query-time, not build-time (see §3) — keeps substrate flexible for future phrase queries.
- **Parallel implementations**: JS for browser query, Python for offline build. Shared regression test set (≥30 strings).

### 3. Query-time policy (distinct from build-time)

A separate axis from the build-time tokenizer.

- **Tokenize the user input** with the same tokenizer used at build (lowercase + NFKC + diacritic strip + whitespace split + length filter). Keeps the round-trip invariant.
- **Drop or downweight English stopwords** from the bag-of-words AND. Curated list (`a`, `an`, `the`, `of`, `from`, `for`, `to`, `in`, `on`, `at`, `is`, `was`, `with`, `and`, `or`) — small, conservative, no language detection.
  - Rationale: a query like `pottery from Cyprus` should not fail because no sample has `from` in its text. Build-time skipping would lose phrase-query potential; query-time is reversible policy.
- **AND-combine the surviving tokens.** Empty surviving set ⇒ empty result with helpful copy.
- **No query-language syntax in v1.** No quoted phrases, no field-prefix operators (`label:foo`), no booleans, no negation. Documented v2 path: phrase quoting first; field-prefix and negation later. (Reference: `query-spec.qmd` Solr surface — explicitly *not* implemented in v1.)

### 4. Substrate row schema

```
{
  token:      VARCHAR  -- normalized token
  pid:        VARCHAR  -- sample primary id
  field:      VARCHAR  -- 'sample.label' | 'sample.description' | 'sample.place_name'
                         | 'concept.label' | (future: 'event.*' | 'site.*' | 'agent.*' …)
  tf:         USMALLINT -- term frequency in this (pid, field) pair
  doc_len:    USMALLINT -- token count of (pid, field) for BM25 length norm
}
```

Field weights are **query-side code**, not substrate data. Adding a v1.5 / v2 field = re-running the build pipeline with more sources, no schema migration.

### 5. Ranking spec

- BM25, fixed `k1=1.2`, `b=0.75` (tune in v1.1 only if benchmark drift demands it).
- DF (per-token document frequency) precomputed at build time, stored alongside the substrate.
- Length norm uses `doc_len` from the schema above.
- Field weights (query-side, v1):

| field                | weight |
|----------------------|--------|
| `sample.label`       | 3.0    |
| `concept.label`      | 2.5    |
| `sample.place_name`  | 2.0    |
| `sample.description` | 1.0    |

Final result rank = sum across (pid, field) BM25 contributions weighted by field weight.

### 6. Partition shape

- Hash-partition by token: `hash(token) % N` shards.
- Per-shard byte cap: **≤ 5 MB** uncompressed parquet.
- High-frequency token rule: if a single token's postings would exceed the cap, sub-shard by `hash(pid) % M` within that token's logical shard.
- Number of top-level shards (`N`): start with 64, refine in build measurement.

### 7. Budgets

| metric                       | target           | rationale                                |
|------------------------------|------------------|------------------------------------------|
| cold first search (P50)      | ≤ 2 s            | matches user expectation for "search"    |
| warm repeat search           | ≤ 500 ms         | substantial improvement over ILIKE       |
| filter-composed cold search  | ≤ 3 s            | accommodates source + facet AND          |
| bytes transferred cold       | ≤ 5 MB           | acceptable on residential broadband      |
| bytes transferred warm       | ≤ 1 MB per query | repeated queries don't refetch shards    |

These are **contract**. Track 5's GO/NO-GO gate is mechanical against this table. **"Warm" disambiguation** (per #174 deferred): the contract distinguishes
- `re-run-same-query warm`: same query, second invocation, same page (measures end-to-end cache + render path)
- `new-query-after-warm-up warm`: different query, after parquet metadata is cached (measures query execution after substrate file is warm)

Both are reported by the benchmark; the budget targets above apply to both.

### 8. Versioning

- URL pattern: `https://data.isamples.org/isamples_YYYYMM_search_index_v1/<shard>.parquet`.
- Explorer pins to a specific `YYYYMM` so a dataset rebuild can't break a deployed site mid-flight.
- Index version tied to data version. v1.x format bumps require URL path bump (`_v1` → `_v2`).

### 9. Curated benchmark + quality gate

- File: `tests/search_benchmark.json`
- 12-15 queries, hand-labeled top-10 by Raymond. Must include:
  - **bare-text queries** (`pottery`, `basalt`)
  - **multi-term** (`pottery Cyprus`)
  - **stopword-heavy** (`pottery from Cyprus`) — verifies query-time stopword policy works
  - **concept-only queries** (`ceramic`, `bone`, `mammal`) — verifies dereferenced concept labels work; **fails loudly** if v1 ships without concept labels
  - **diacritic** (`Çatalhöyük`)
  - **no-hit** (`xyzzyqqqplugh`)
  - **filter-composed cases** (source-only, source + material)
- **Quality gate is a hard requirement**, not advisory. Each release of the substrate must hit:
  - top-3 result-set overlap with hand-labeled set: ≥ TBD% (calibrate after #167 baseline + #171 prototype)
  - top-10 result-set overlap with hand-labeled set: ≥ TBD%
  - **zero "concept-only" benchmark queries return empty results**

### 10. Build-stats artifact (contract requirement)

The v1 substrate build pipeline (#170) MUST emit `build_stats.json` alongside the partitioned token-row parquets, recording per-virtual-field populated-sample-count, total-token-count, average doc length, concept-label URI resolution rate, top-DF tokens, and shard size distribution. Schema and acceptance thresholds are specified in #170 §6.

This contract item exists so `SEARCH_INDEX_V1.md` and the builder cannot drift: every release of the substrate carries empirical coverage data, not a doc claim about coverage.

### 11. Out of scope (v1)

- Solr-parity field set (named in §1 as the v2 expansion path; not implemented).
- Stemming (English-specific, hurts non-English content; v2+ if at all).
- Query-language syntax: quoted phrases, field operators, booleans, negation, wildcards, fuzzy matching, ranges, boosts — all v2+.

## Acceptance

- [ ] `SEARCH_INDEX_V1.md` lands in repo root (or `docs/` — match the EXPLORER_STATE.md placement)
- [ ] All 11 sections above populated
- [ ] `tests/search_benchmark.json` lands with hand-labeled top-10 for the canonical query set, including the concept-only and stopword-heavy queries
- [ ] Build-stats artifact requirement (§10) referenced in #170 acceptance
- [ ] Doc-only PR; no pipeline or browser code

## Refs

#165, #164, PR #95, #170 (build-stats), #174 (warm disambiguation)



virtual field	source	rationale
`sample.label`	`MaterialSampleRecord.label` (~6.68M coverage)	canonical title; near-universal
`sample.description`	`MaterialSampleRecord.description` (~1.61M ≈ 24%)	sparse but high-signal where present
`sample.place_name`	`samples_map_lite.parquet.place_name[]` (~2.21M)	already proven valuable in current ILIKE search
`concept.label`	`material` / `context` / `object_type` URIs dereferenced via `vocab_labels.parquet` (`pref_label`, lang=en)	load-bearing addition: facet URIs are near-universal but raw URIs are useless to FTS; dereferenced labels make `pottery`, `ceramic`, `basalt`, `bone`, `marine` work as the user expects

virtual field	source (Solr equivalent)
`event.label`	`producedBy_label` (~1.92M)
`event.description`	`producedBy_description` (~5.54M ≈ 83%)
`event.has_feature_of_interest`	`producedBy_hasFeatureOfInterest` (~6.35M ≈ 95%)
`event.sampling_purpose`	`producedBy_samplingPurpose` (~262K)
`site.label`	`producedBy_samplingSite_label` (~190K)
`site.description`	`producedBy_samplingSite_description` (~172K)
`site.place_name`	`producedBy_samplingSite_placeName[]` (~336K rows)

virtual field	source
`agent.name`	registrant + responsibility agents
`curation.label`	`curation_label`
`curation.description`	`curation_description`
`curation.location`	`curation_location`
`keywords`	(if present)
`source`	`source` (already a facet; low-value as FTS)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explorer FTS Track 2: search_index_v1 contract doc #169

Goal

Required contents

1. Sample search document projection

2. Tokenizer (build-time)

3. Query-time policy (distinct from build-time)

4. Substrate row schema

5. Ranking spec

6. Partition shape

7. Budgets

8. Versioning

9. Curated benchmark + quality gate

10. Build-stats artifact (contract requirement)

11. Out of scope (v1)

Acceptance

Refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

metric	target	rationale
cold first search (P50)	≤ 2 s	matches user expectation for "search"
warm repeat search	≤ 500 ms	substantial improvement over ILIKE
filter-composed cold search	≤ 3 s	accommodates source + facet AND
bytes transferred cold	≤ 5 MB	acceptable on residential broadband
bytes transferred warm	≤ 1 MB per query	repeated queries don't refetch shards

Explorer FTS Track 2: search_index_v1 contract doc #169

Description

Goal

Required contents

1. Sample search document projection

2. Tokenizer (build-time)

3. Query-time policy (distinct from build-time)

4. Substrate row schema

5. Ranking spec

6. Partition shape

7. Budgets

8. Versioning

9. Curated benchmark + quality gate

10. Build-stats artifact (contract requirement)

11. Out of scope (v1)

Acceptance

Refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions