Skip to content

How the Interactive Explorer queries work (tutorial for contributors) #82

@rdhyee

Description

@rdhyee

How the Interactive Explorer Queries Work

A walkthrough of the data architecture and SQL queries powering the Interactive Explorer, for contributors who want to understand or extend it.

Source code: tutorials/isamples_explorer.qmd


Architecture: Two-Tier Loading

The Explorer uses DuckDB-WASM — a full SQL engine running in the browser. No server is involved. It queries parquet files on data.isamples.org using HTTP range requests (only downloads the bytes it needs).

Tier 1 — Instant facet counts (2 KB):

data.isamples.org/isamples_202601_facet_summaries.parquet

Pre-computed counts for each filter category (source, material, sampled feature, specimen type). Loads in milliseconds so the sidebar filters appear immediately.

Tier 2 — Full sample data (278 MB, but only partial bytes transferred):

data.isamples.org/isamples_202601_wide.parquet

The wide-format PQG file with 20M rows. DuckDB-WASM uses range requests to read only the columns and rows matching the current filters — typically a few MB of actual transfer even though the file is 278 MB.


The Key Queries

1. Facet summaries (Tier 1)

Source: lines 346–357

SELECT * FROM read_parquet('https://data.isamples.org/isamples_202601_facet_summaries.parquet')

Returns rows like:

facet_type facet_value count
source SESAR 3,428,639
source OPENCONTEXT 1,096,127
material Rock 1,234,567
context Earth interior 890,123
object_type Physical specimen 2,345,678

These populate the checkbox counts in the sidebar before the big parquet file loads.

2. Create a SQL view over the remote parquet

Source: lines 311–316

CREATE VIEW samples AS
SELECT * FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet')

This doesn't download the file — it registers it as a virtual table. DuckDB-WASM only fetches data when you actually query it.

3. Dynamic WHERE clause from filters

Source: lines 388–434

The WHERE clause is built dynamically from the current UI state:

-- Always applied (base conditions):
otype = 'MaterialSampleRecord'
AND latitude IS NOT NULL

-- Text search (when user types in search box):
AND (label ILIKE '%pottery%'
     OR description ILIKE '%pottery%'
     OR CAST(place_name AS VARCHAR) ILIKE '%pottery%')

-- Source filter (checkbox selection):
AND n IN ('SESAR', 'OPENCONTEXT')

-- Material filter:
AND has_material_category IN ('Rock', 'Ite Biology')

-- Sampled feature filter:
AND has_context_category IN ('Earth interior')

-- Specimen type filter:
AND has_specimen_category IN ('Physical specimen')

Key column mappings in the wide parquet:

  • n → source name (SESAR, OPENCONTEXT, GEOME, SMITHSONIAN)
  • otype → entity type (MaterialSampleRecord, SamplingEvent, etc.)
  • has_material_category → material classification
  • has_context_category → sampled feature / context
  • has_specimen_category → specimen type
  • label, description, place_name → text-searchable fields

4. Count matching records

Source: lines 447–455

SELECT COUNT(*) as count
FROM samples
WHERE {whereClause}

5. Load sample data for display

Source: lines 460–500

SELECT
  row_id, pid, label,
  COALESCE(description, '') as description,
  latitude, longitude,
  n as source, place_name
FROM samples
WHERE {whereClause}
ORDER BY RANDOM()
LIMIT {maxSamples}   -- default 25,000

The ORDER BY RANDOM() ensures a representative geographic spread rather than returning the first N rows (which might cluster in one region). The LIMIT caps browser memory usage.


Data Flow Summary

User adjusts filters
    ↓
Observable reactivity rebuilds WHERE clause
    ↓
DuckDB-WASM sends HTTP range requests to data.isamples.org
    ↓
Only matching columns/row groups are downloaded (~2-10 MB)
    ↓
Results rendered as Globe (Cesium) / Table / List
    ↓
URL params updated for bookmarkability

Data Files Reference

File Size Purpose
facet_summaries.parquet 2 KB Pre-computed filter counts
wide.parquet 278 MB Full sample data (queried via range requests)
wide_h3.parquet 292 MB Wide + H3 hex indices (used by globe tutorials)
samples_map_lite.parquet 60 MB Lightweight subset for fast map rendering

All files served from Cloudflare R2 via data.isamples.org with range request + CORS support.


How to Extend

Add a new filter:

  1. Add a new facet type to the facet_summaries parquet (generated by PQG pipeline)
  2. Add a checkbox widget in the Filters section
  3. Add a condition to the whereClause builder

Add a new view mode:

  1. Add option to the viewMode radio selector
  2. Add rendering branch in the results display block

Change data source: Update parquet_url and facet_summaries_url at the top of the file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationexplorerInteractive Explorer features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions