Skip to content

Explorer FTS Track 2: search_index_v1 contract doc #169

@rdhyee

Description

@rdhyee

Updated 2026-05-08 per Codex review on #165. Major changes: refactored around a sample-centric document projection (Section 1), v1 minimum now includes dereferenced concept labels, query-time policy split from build-time tokenizer (new Section 3), quality gate hardened (Section 8). Original framing preserved in git history.

Sub-issue of #165. No code dependencies; can start in parallel with #N1 (#167).

Goal

Land a single design doc — SEARCH_INDEX_V1.md — that pins the v1 substrate contract before any pipeline or query code is written. Tracks 3-5 implement and measure against this doc.

Required contents

1. Sample search document projection

The substrate is not "tokenize these parquet columns." It is "tokenize a sample-centric document whose text fragments are joined across the property graph and tagged by their entity origin."

Each sample (pid) has a logical document of weighted text fragments. At build time, a join across the wide parquet (or its narrow equivalents) produces this projection; the substrate then tokenizes per fragment, tagging each token row with the virtual field name (entity dot field), not the source parquet column.

v1 minimum — the projection that ships first:

virtual field source rationale
sample.label MaterialSampleRecord.label (~6.68M coverage) canonical title; near-universal
sample.description MaterialSampleRecord.description (~1.61M ≈ 24%) sparse but high-signal where present
sample.place_name samples_map_lite.parquet.place_name[] (~2.21M) already proven valuable in current ILIKE search
concept.label material / context / object_type URIs dereferenced via vocab_labels.parquet (pref_label, lang=en) load-bearing addition: facet URIs are near-universal but raw URIs are useless to FTS; dereferenced labels make pottery, ceramic, basalt, bone, marine work as the user expects

A sample whose material URI is <…>/Pottery gets a row {token: 'pottery', pid: ..., field: 'concept.label', tf: 1, ...}. One row per sample per facet URI per token. Coverage: most of the dataset.

v1.5 expansion (post-v1, additive — no schema change):

virtual field source (Solr equivalent)
event.label producedBy_label (~1.92M)
event.description producedBy_description (~5.54M ≈ 83%)
event.has_feature_of_interest producedBy_hasFeatureOfInterest (~6.35M ≈ 95%)
event.sampling_purpose producedBy_samplingPurpose (~262K)
site.label producedBy_samplingSite_label (~190K)
site.description producedBy_samplingSite_description (~172K)
site.place_name producedBy_samplingSite_placeName[] (~336K rows)

v2 / Solr searchText parity (named, not built):

virtual field source
agent.name registrant + responsibility agents
curation.label curation_label
curation.description curation_description
curation.location curation_location
keywords (if present)
source source (already a facet; low-value as FTS)

2. Tokenizer (build-time)

  • Lowercase ASCII via String.prototype.toLowerCase() / Python str.lower().
  • Unicode NFKC normalization.
  • Diacritic stripping via NFD + combining-mark removal.
  • Whitespace split, punctuation stripped, length filter (1 ≤ len ≤ 64).
  • No stemming. Honest limitation; document in UI copy.
  • Index every token, including stopwords. Stopword handling is query-time, not build-time (see §3) — keeps substrate flexible for future phrase queries.
  • Parallel implementations: JS for browser query, Python for offline build. Shared regression test set (≥30 strings).

3. Query-time policy (distinct from build-time)

A separate axis from the build-time tokenizer.

  • Tokenize the user input with the same tokenizer used at build (lowercase + NFKC + diacritic strip + whitespace split + length filter). Keeps the round-trip invariant.
  • Drop or downweight English stopwords from the bag-of-words AND. Curated list (a, an, the, of, from, for, to, in, on, at, is, was, with, and, or) — small, conservative, no language detection.
    • Rationale: a query like pottery from Cyprus should not fail because no sample has from in its text. Build-time skipping would lose phrase-query potential; query-time is reversible policy.
  • AND-combine the surviving tokens. Empty surviving set ⇒ empty result with helpful copy.
  • No query-language syntax in v1. No quoted phrases, no field-prefix operators (label:foo), no booleans, no negation. Documented v2 path: phrase quoting first; field-prefix and negation later. (Reference: query-spec.qmd Solr surface — explicitly not implemented in v1.)

4. Substrate row schema

{
  token:      VARCHAR  -- normalized token
  pid:        VARCHAR  -- sample primary id
  field:      VARCHAR  -- 'sample.label' | 'sample.description' | 'sample.place_name'
                         | 'concept.label' | (future: 'event.*' | 'site.*' | 'agent.*' …)
  tf:         USMALLINT -- term frequency in this (pid, field) pair
  doc_len:    USMALLINT -- token count of (pid, field) for BM25 length norm
}

Field weights are query-side code, not substrate data. Adding a v1.5 / v2 field = re-running the build pipeline with more sources, no schema migration.

5. Ranking spec

  • BM25, fixed k1=1.2, b=0.75 (tune in v1.1 only if benchmark drift demands it).
  • DF (per-token document frequency) precomputed at build time, stored alongside the substrate.
  • Length norm uses doc_len from the schema above.
  • Field weights (query-side, v1):
field weight
sample.label 3.0
concept.label 2.5
sample.place_name 2.0
sample.description 1.0

Final result rank = sum across (pid, field) BM25 contributions weighted by field weight.

6. Partition shape

  • Hash-partition by token: hash(token) % N shards.
  • Per-shard byte cap: ≤ 5 MB uncompressed parquet.
  • High-frequency token rule: if a single token's postings would exceed the cap, sub-shard by hash(pid) % M within that token's logical shard.
  • Number of top-level shards (N): start with 64, refine in build measurement.

7. Budgets

metric target rationale
cold first search (P50) ≤ 2 s matches user expectation for "search"
warm repeat search ≤ 500 ms substantial improvement over ILIKE
filter-composed cold search ≤ 3 s accommodates source + facet AND
bytes transferred cold ≤ 5 MB acceptable on residential broadband
bytes transferred warm ≤ 1 MB per query repeated queries don't refetch shards

These are contract. Track 5's GO/NO-GO gate is mechanical against this table. "Warm" disambiguation (per #174 deferred): the contract distinguishes

  • re-run-same-query warm: same query, second invocation, same page (measures end-to-end cache + render path)
  • new-query-after-warm-up warm: different query, after parquet metadata is cached (measures query execution after substrate file is warm)

Both are reported by the benchmark; the budget targets above apply to both.

8. Versioning

  • URL pattern: https://data.isamples.org/isamples_YYYYMM_search_index_v1/<shard>.parquet.
  • Explorer pins to a specific YYYYMM so a dataset rebuild can't break a deployed site mid-flight.
  • Index version tied to data version. v1.x format bumps require URL path bump (_v1_v2).

9. Curated benchmark + quality gate

  • File: tests/search_benchmark.json
  • 12-15 queries, hand-labeled top-10 by Raymond. Must include:
    • bare-text queries (pottery, basalt)
    • multi-term (pottery Cyprus)
    • stopword-heavy (pottery from Cyprus) — verifies query-time stopword policy works
    • concept-only queries (ceramic, bone, mammal) — verifies dereferenced concept labels work; fails loudly if v1 ships without concept labels
    • diacritic (Çatalhöyük)
    • no-hit (xyzzyqqqplugh)
    • filter-composed cases (source-only, source + material)
  • Quality gate is a hard requirement, not advisory. Each release of the substrate must hit:

10. Build-stats artifact (contract requirement)

The v1 substrate build pipeline (#170) MUST emit build_stats.json alongside the partitioned token-row parquets, recording per-virtual-field populated-sample-count, total-token-count, average doc length, concept-label URI resolution rate, top-DF tokens, and shard size distribution. Schema and acceptance thresholds are specified in #170 §6.

This contract item exists so SEARCH_INDEX_V1.md and the builder cannot drift: every release of the substrate carries empirical coverage data, not a doc claim about coverage.

11. Out of scope (v1)

  • Solr-parity field set (named in §1 as the v2 expansion path; not implemented).
  • Stemming (English-specific, hurts non-English content; v2+ if at all).
  • Query-language syntax: quoted phrases, field operators, booleans, negation, wildcards, fuzzy matching, ranges, boosts — all v2+.

Acceptance

  • SEARCH_INDEX_V1.md lands in repo root (or docs/ — match the EXPLORER_STATE.md placement)
  • All 11 sections above populated
  • tests/search_benchmark.json lands with hand-labeled top-10 for the canonical query set, including the concept-only and stopword-heavy queries
  • Build-stats artifact requirement (§10) referenced in Explorer FTS Track 3: Offline index builder + tokenizer regression set #170 acceptance
  • Doc-only PR; no pipeline or browser code

Refs

#165, #164, PR #95, #170 (build-stats), #174 (warm disambiguation)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions