Skip to content

explorer: 'Samples in View' counter is the fetch budget, not the real count (#201 Part 1) #206

@rdhyee

Description

@rdhyee

Originally filed as Part 1 of #201. Splitting out as a dedicated issue since #201 was closed by #203 / #205 (which fixed Part 2 only).

Symptom

The "Samples in View" stat box reads exactly 5,000 in dense regions — the value of DEFAULT_POINT_BUDGET at explorer.qmd:418. In Cyprus (lat ≈ 34.99, lng ≈ 33.70), direct DuckDB query against data.isamples.org/isamples_202601_samples_map_lite.parquet returns 23,421 samples in a ±0.1° box. The counter underreports by ~5x there. The cluster is one dense site (almost certainly Polis Excavations, OPENCONTEXT source).

Root cause

explorer.qmd:1530-1538 — the point-mode viewport query:

SELECT pid, label, source, latitude, longitude, place_name, result_time
FROM read_parquet('${lite_url}')
WHERE latitude BETWEEN ${padded.south} AND ${padded.north}
  AND longitude BETWEEN ${padded.west} AND ${padded.east}
  ${sourceFilterSQL('source')}
  ${facetFilterSQL()}
LIMIT 5000

explorer.qmd:1557:

updateStats('Samples', cachedData.length, cachedData.length, ..., 'Samples in View', 'Samples in View');

cachedData.length IS the row count of the LIMIT 5000 result. The counter therefore tops out at 5000 by construction.

Secondary smells:

  • No ORDER BY before LIMIT → which 5000 rows return is undefined (probably stable in DuckDB-on-parquet but not contractual).
  • Label says "in View" but fetch uses a padded (30%) viewport (explorer.qmd:1514-1522). Even ignoring the cap, the count meaning is loose.
  • renderSamplePoints plots all of cachedData including rows outside the actual viewport.

Fix directions (from Codex retrospective on #203)

In rough order of effort:

  1. Honest relabel (cheapest): change the label to "Samples Loaded (max N)" and wire the budget value into the label. Counter stops lying.
  2. Compute real count alongside: a fast SELECT count(*) against the same WHERE (no LIMIT) is cheap on the lite parquet via DuckDB-WASM range reads. Display "X loaded / Y total in view", with explicit signaling when Y > X.
  3. Adaptive aggregation: if real count > budget, fall back to a cluster-style representation or surface a "too dense to render individually — Y samples here" affordance.
  4. Add ORDER BY pid to the point query so the 5000 subset is at least deterministic across browsers and sessions.

Direction 2 (real-count alongside) is probably the right user-visible answer; direction 4 is independent and could ship with any of the others.

Acceptance

  • Counter accurately represents the in-view sample count, or is unambiguously labeled as a capped/loaded count.
  • Cyprus deep-link (#v=1&lat=34.9957&lng=33.6798&alt=15212&mode=point) shows a number that does not silently understate the real density.
  • No regression in cluster-mode "Samples in View" (which already counts viewport intersections correctly).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions