Skip to content

Explore DuckDB FTS extension for full-text search in Explorer #84

@rdhyee

Description

@rdhyee

Problem

The Interactive Explorer currently uses ILIKE '%term%' for text search (line 398). This is a brute-force substring scan — no ranking, no stemming, no stopword handling.

Proposal

Use DuckDB's full-text search extension, which is supported in DuckDB-WASM. This would give us BM25-ranked results, Porter stemming, and stopword filtering.

Approach: Pre-built index

Building the FTS index at runtime in the browser would be too slow for 6.7M records. Instead:

  1. Build offline: Run PRAGMA create_fts_index('samples', 'pid', 'label', 'description', 'place_name') in native DuckDB
  2. Export as .duckdb file: The FTS index is stored as plain tables + macros in a named schema
  3. Host on data.isamples.org alongside the parquet files
  4. ATTACH in browser: ATTACH 'https://data.isamples.org/isamples_fts_index.duckdb' — BM25 scoring comes along for free

Query change

-- Before (substring scan):
WHERE label ILIKE '%pottery%' OR description ILIKE '%pottery%'

-- After (ranked FTS):
SELECT *, score
FROM (
  SELECT *, fts_main_samples.match_bm25(pid, 'pottery') AS score
  FROM samples
)
WHERE score IS NOT NULL
ORDER BY score DESC

Open questions

  • How large will the .duckdb index file be for 6.7M records with 3-4 text columns?
  • Does ATTACH over HTTP work reliably in DuckDB-WASM for files this size?
  • Should we index all text columns or just label + description?
  • Alternative: DuckDB text analytics functions (stemming, stopwords) without a full FTS index — lighter weight but still better than raw ILIKE

Related

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestexplorerInteractive Explorer features

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions