How the Interactive Explorer queries work (tutorial for contributors)

# How the Interactive Explorer Queries Work

A walkthrough of the data architecture and SQL queries powering the [Interactive Explorer](https://isamples.org/tutorials/isamples_explorer.html), for contributors who want to understand or extend it.

**Source code**: [`tutorials/isamples_explorer.qmd`](https://github.com/isamplesorg/isamplesorg.github.io/blob/main/tutorials/isamples_explorer.qmd)

---

## Architecture: Two-Tier Loading

The Explorer uses **DuckDB-WASM** — a full SQL engine running in the browser. No server is involved. It queries parquet files on `data.isamples.org` using HTTP range requests (only downloads the bytes it needs).

**Tier 1 — Instant facet counts** (2 KB):
```
data.isamples.org/isamples_202601_facet_summaries.parquet
```
Pre-computed counts for each filter category (source, material, sampled feature, specimen type). Loads in milliseconds so the sidebar filters appear immediately.

**Tier 2 — Full sample data** (278 MB, but only partial bytes transferred):
```
data.isamples.org/isamples_202601_wide.parquet
```
The wide-format PQG file with 20M rows. DuckDB-WASM uses range requests to read only the columns and rows matching the current filters — typically a few MB of actual transfer even though the file is 278 MB.

---

## The Key Queries

### 1. Facet summaries (Tier 1)

[Source: lines 346–357](https://github.com/isamplesorg/isamplesorg.github.io/blob/main/tutorials/isamples_explorer.qmd#L346-L357)

```sql
SELECT * FROM read_parquet('https://data.isamples.org/isamples_202601_facet_summaries.parquet')
```

Returns rows like:

| facet_type | facet_value | count |
|-----------|-------------|-------|
| source | SESAR | 3,428,639 |
| source | OPENCONTEXT | 1,096,127 |
| material | Rock | 1,234,567 |
| context | Earth interior | 890,123 |
| object_type | Physical specimen | 2,345,678 |

These populate the checkbox counts in the sidebar **before** the big parquet file loads.

### 2. Create a SQL view over the remote parquet

[Source: lines 311–316](https://github.com/isamplesorg/isamplesorg.github.io/blob/main/tutorials/isamples_explorer.qmd#L311-L316)

```sql
CREATE VIEW samples AS
SELECT * FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet')
```

This doesn't download the file — it registers it as a virtual table. DuckDB-WASM only fetches data when you actually query it.

### 3. Dynamic WHERE clause from filters

[Source: lines 388–434](https://github.com/isamplesorg/isamplesorg.github.io/blob/main/tutorials/isamples_explorer.qmd#L388-L434)

The WHERE clause is built dynamically from the current UI state:

```sql
-- Always applied (base conditions):
otype = 'MaterialSampleRecord'
AND latitude IS NOT NULL

-- Text search (when user types in search box):
AND (label ILIKE '%pottery%'
     OR description ILIKE '%pottery%'
     OR CAST(place_name AS VARCHAR) ILIKE '%pottery%')

-- Source filter (checkbox selection):
AND n IN ('SESAR', 'OPENCONTEXT')

-- Material filter:
AND has_material_category IN ('Rock', 'Ite Biology')

-- Sampled feature filter:
AND has_context_category IN ('Earth interior')

-- Specimen type filter:
AND has_specimen_category IN ('Physical specimen')
```

Key column mappings in the wide parquet:
- `n` → source name (SESAR, OPENCONTEXT, GEOME, SMITHSONIAN)
- `otype` → entity type (MaterialSampleRecord, SamplingEvent, etc.)
- `has_material_category` → material classification
- `has_context_category` → sampled feature / context
- `has_specimen_category` → specimen type
- `label`, `description`, `place_name` → text-searchable fields

### 4. Count matching records

[Source: lines 447–455](https://github.com/isamplesorg/isamplesorg.github.io/blob/main/tutorials/isamples_explorer.qmd#L447-L455)

```sql
SELECT COUNT(*) as count
FROM samples
WHERE {whereClause}
```

### 5. Load sample data for display

[Source: lines 460–500](https://github.com/isamplesorg/isamplesorg.github.io/blob/main/tutorials/isamples_explorer.qmd#L460-L500)

```sql
SELECT
  row_id, pid, label,
  COALESCE(description, '') as description,
  latitude, longitude,
  n as source, place_name
FROM samples
WHERE {whereClause}
ORDER BY RANDOM()
LIMIT {maxSamples}   -- default 25,000
```

The `ORDER BY RANDOM()` ensures a representative geographic spread rather than returning the first N rows (which might cluster in one region). The `LIMIT` caps browser memory usage.

---

## Data Flow Summary

```
User adjusts filters
    ↓
Observable reactivity rebuilds WHERE clause
    ↓
DuckDB-WASM sends HTTP range requests to data.isamples.org
    ↓
Only matching columns/row groups are downloaded (~2-10 MB)
    ↓
Results rendered as Globe (Cesium) / Table / List
    ↓
URL params updated for bookmarkability
```

---

## Data Files Reference

| File | Size | Purpose |
|------|------|---------|
| [`facet_summaries.parquet`](https://data.isamples.org/isamples_202601_facet_summaries.parquet) | 2 KB | Pre-computed filter counts |
| [`wide.parquet`](https://data.isamples.org/isamples_202601_wide.parquet) | 278 MB | Full sample data (queried via range requests) |
| [`wide_h3.parquet`](https://data.isamples.org/isamples_202601_wide_h3.parquet) | 292 MB | Wide + H3 hex indices (used by globe tutorials) |
| [`samples_map_lite.parquet`](https://data.isamples.org/isamples_202601_samples_map_lite.parquet) | 60 MB | Lightweight subset for fast map rendering |

All files served from Cloudflare R2 via `data.isamples.org` with range request + CORS support.

---

## How to Extend

**Add a new filter**: 
1. Add a new facet type to the `facet_summaries` parquet (generated by PQG pipeline)
2. Add a checkbox widget in the Filters section
3. Add a condition to the `whereClause` builder

**Add a new view mode**: 
1. Add option to the `viewMode` radio selector
2. Add rendering branch in the results display block

**Change data source**: Update `parquet_url` and `facet_summaries_url` at the top of the file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How the Interactive Explorer queries work (tutorial for contributors) #82

How the Interactive Explorer Queries Work

Architecture: Two-Tier Loading

The Key Queries

1. Facet summaries (Tier 1)

2. Create a SQL view over the remote parquet

3. Dynamic WHERE clause from filters

4. Count matching records

5. Load sample data for display

Data Flow Summary

Data Files Reference

How to Extend

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

facet_type	facet_value	count
source	SESAR	3,428,639
source	OPENCONTEXT	1,096,127
material	Rock	1,234,567
context	Earth interior	890,123
object_type	Physical specimen	2,345,678

File	Size	Purpose
`facet_summaries.parquet`	2 KB	Pre-computed filter counts
`wide.parquet`	278 MB	Full sample data (queried via range requests)
`wide_h3.parquet`	292 MB	Wide + H3 hex indices (used by globe tutorials)
`samples_map_lite.parquet`	60 MB	Lightweight subset for fast map rendering

How the Interactive Explorer queries work (tutorial for contributors) #82

Description

How the Interactive Explorer Queries Work

Architecture: Two-Tier Loading

The Key Queries

1. Facet summaries (Tier 1)

2. Create a SQL view over the remote parquet

3. Dynamic WHERE clause from filters

4. Count matching records

5. Load sample data for display

Data Flow Summary

Data Files Reference

How to Extend

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions