feat(mask): add range-aware Runs variant + insert_run / iter_runs APIs#6830
Draft
westonpace wants to merge 3 commits into
Draft
feat(mask): add range-aware Runs variant + insert_run / iter_runs APIs#6830westonpace wants to merge 3 commits into
Runs variant + insert_run / iter_runs APIs#6830westonpace wants to merge 3 commits into
Conversation
Add a criterion benchmark suite targeting RowAddrMask / RowAddrTreeMap
that quantifies the cost of operations whose work is fundamentally
range-shaped but currently goes through per-row Partial(RoaringBitmap)
representation. Six groups:
insert_range_single_run - producer cost: insert one range
into_addr_iter_single_run - consumer cost: walk every row addr
next_range_iter_single_run - achievable cost via Iter::next_range
intersect_two_runs - set op on two range-shaped masks
mask_to_offset_ranges_inner_loop - end-to-end slow path observed in
IS NULL trace (495 ms / 889 ms)
insert_runs_constant_cardinality - many small runs vs one big run
Each varies dataset size while holding number-of-ranges fixed at 1, so
linear scaling in N reveals where row count dominates the cost.
Headline finding (10M-row inputs):
into_addr_iter: 19.4 ms per-bit walk
next_range iter: 1.72 us per-run walk (~11000x faster)
The next_range/iter delta represents the speedup an alternate
range-aware iterator could surface to callers. The roaring crate
already represents the data as run-encoded containers; the
RowAddrMask public API does not expose them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third RowAddrSelection variant, `Runs(Vec<RangeInclusive<u32>>)`,
that stores a per-fragment selection as a sorted, non-overlapping,
non-adjacent list of run-length-encoded ranges. This is the backwards-
compatible step toward a range-aware row-address mask: existing
producers and consumers keep working unchanged, while new range-shaped
callers can sidestep the per-row roaring bitmap that today dominates
mask construction and iteration cost.
New APIs on RowAddrTreeMap:
insert_run(fragment_id, run)
Range-shaped producer counterpart to insert(value) / insert_range.
O(1) amortized when the run extends or is adjacent to the last
entry (the common case for in-order producers like scalar-index
zone searches). Merges into existing Runs preserving invariants.
Falls back to Partial-bitmap inserts when the existing entry is
already Partial (so scalar inserts never silently re-shape data).
iter_runs() -> Iterator<(u32, RangeInclusive<u32>)>
Range-shaped consumer counterpart to into_addr_iter. Yields one
item per contiguous run, not per row. For `Runs` entries the runs
are emitted directly; for `Partial` entries roaring's
Iter::next_range surfaces the bitmap's internal run encoding.
Panics on `Full` (same contract as into_addr_iter).
canonicalize_to_partial(fragment_id)
Force a Runs entry into its equivalent Partial form. Useful for
callers that need raw bitmap access via get_fragment_bitmap.
Compatibility:
* Every existing match site on RowAddrSelection grew a Runs arm that
either handles the variant natively (len, contains, row_addrs,
iter_runs, into_addr_iter, serialize_into, etc.) or inflates to
Partial via the private into_partial_bitmap helper for ops not
yet range-aware (insert, remove, BitOr/BitAnd/Sub, FromIterator,
Extend). All 97 existing mask tests pass unchanged.
* On-disk format is unchanged: serialize_into inflates Runs to its
equivalent bitmap before writing, so readers built against older
versions continue to load. deserialize_from always yields Partial.
* Hot paths use itertools::Either rather than Box<dyn Iterator> so
the new variant adds no dyn-dispatch cost to the existing Partial
iteration path. Verified by criterion: into_addr_iter at 10M rows
is 19.9 ms before and after.
Benchmark deltas (single contiguous run, vs the pre-existing APIs
documented in commit 1b9d7c0):
Producer (insert one run of N rows):
insert_range insert_run speedup
N = 10K 54 ns 31 ns 1.8x
N = 100K 67 ns 31 ns 2.2x
N = 1M 543 ns 31 ns 17.7x
N = 10M 6,499 ns 31 ns 210x
Consumer (iterate selection of N rows):
into_addr_iter iter_runs speedup
N = 10K 19,396 ns 6.3 ns 3,078x
N = 100K 193,111 ns 6.4 ns 30,209x
N = 1M 1,943,641 ns 6.3 ns 306,879x
N = 10M 19,871,915 ns 6.3 ns 3,154,000x
Many runs (1M total cardinality, K runs):
insert_range insert_run speedup
K = 1 608 ns 32 ns 19.2x
K = 10 827 ns 199 ns 4.2x
K = 100 3,123 ns 769 ns 4.1x
K = 1,000 28,416 ns 5,680 ns 5.0x
K = 10,000 272,891 ns 49,155 ns 5.6x
11 new unit tests cover invariant preservation, mixed-variant set ops,
serialization round-trip, and degradation rules (insert into Partial
collapses to Partial, insert into Full is no-op).
filtered_read.rs gains a Runs arm in the existing FilteredReadPlan
consumer at line 1606 so callers wiring the new producer through that
path are not blocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply cargo fmt to the new Runs-variant code and address two clippy findings: * manual_let_else in BitAndAssign: convert the `Some(set) => set, None => continue` match into a `let ... else` (the retain pass above already guarantees the entry exists; the else arm is just a defensive skip). * identity_op in test_iter_runs_mixed_variants: drop the stray `+ 0` in the second insert_range bound. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RowAddrSelectionvariant,Runs(Vec<RangeInclusive<u32>>),for storing range-shaped per-fragment selections without inflating to a
per-row roaring bitmap.
RowAddrTreeMap:insert_run(fragment_id, run)for range-shaped producers and
iter_runs()for range-shaped consumers.row_addr_mask) that pins the row-cardinality scaling weakness of the existing API and the cost saved by
the new one.
This is an additive, backwards-compatible change — existing code paths
keep working unchanged, the on-disk format does not move, and all 97
pre-existing
utils::masktests pass alongside 11 new ones.Motivation
Producers like
lance-index'ssearch_zonesand consumers likemask_to_offset_rangesoperate naturally on row-address ranges, butthe only public
RowAddrSelectionrepresentations today areFullandPartial(RoaringBitmap). Every range-shaped result therefore round-tripsthrough a per-row bitmap, so the cost of using a
RowAddrMaskis set bythe row cardinality of the result, not the number of distinct ranges.
The baseline benchmark suite (
row_addr_mask) introduced in the firstcommit of this PR makes this concrete:
insert_range)into_addr_iter)mask_to_offset_ranges_inner_loop)The consumer-side gap (≈11,000×) is the largest, and matches what we
observed in production: a chrome trace of
IS NULLagainst a zonemap-indexed 10M-row dataset spent ≈495 ms of 889 ms inside
mask_to_offset_ranges, all of it converting between the per-row maskand a
Vec<Range<u64>>.What this PR does not do
It does not migrate any callers to the new APIs, and does not change
on-disk semantics. The point of this PR is to land the data structure +
API surface + benchmarks so follow-up PRs can cut over
search_zones,mask_to_offset_ranges, and friends one at a time with a measurabledelta each.
API surface
Both
insert_runanditer_runsare full citizens of the maskmachinery:
insert_runpreserves the sorted / non-overlapping / non-adjacentinvariants even on unsorted input. Merging is O(num_runs) in the
pathological case, O(1) amortized in the common in-order case.
iter_runsworks on all three variants: yields stored rangesfor
Runs, surfaces roaring's container run-encoding forPartialvia
Iter::next_range, and panics onFull(matching the existinginto_addr_itercontract — sameunsafejustification).Backwards compatibility
utils::masktestsserialize_intoinflatesRunsto its equivalent bitmap before writing; old readers continue to load.deserialize_fromalways returnsPartial.into_addr_itercostitertools::Eitherrather thanBox<dyn Iterator>so the new arm adds no dynamic dispatch to the existing path.&,|,-,Extend)Runsinputs by transparently inflating toPartialbefore applying the existing roaring-bitmap logic — semantics-preserving fallback. Native run-shaped set ops are deferred to a follow-up sinceintersect_two_runsat 10M rows is already 12 µs (not a bottleneck).RowAddrSelectionconsumers outsidemask.rsfiltered_read.rs(inFilteredReadExec::with_plan) grew aRunsarm that emits the stored runs asRange<u64>directly. Compilation guarantees no other consumer was missed.Benchmark results
Run with
cargo bench -p lance-core --bench row_addr_mask.Producer: insert one run covering N rows
insert_range(existing)insert_run(new)Consumer: iterate selection of N rows
into_addr_iter(existing)iter_runs(new)Producer: K runs summing to 1M rows
insert_range(existing)insert_run(new)Test plan
utils::masktests still pass.insert_runinvariant preservation on in-order, out-of-order, andoverlapping inputs.
insert_rundegradation rules: Full → no-op, Partial → staysPartial, empty/Runs → stays Runs.
iter_runsagainst pure-Runs, pure-Partial, and mixed-variant maps.canonicalize_to_partialconverts Runs to Partial in place.Runs-built map and the equivalentPartial-built map produce byte-identical on-disk output.&,|,-) yield identical cardinalities forRuns-built andPartial-built equivalent inputs.cargo build --workspace --testsclean.cargo bench -p lance-core --bench row_addr_maskruns end-to-endand the criterion
change:output confirmsinto_addr_iterwas notregressed by the new variant.
Follow-ups (not in this PR)
lance_index::scalar::zoned::search_zonesfromRowAddrTreeMap::insert_rangetoinsert_run. This is the producerhalf of the zonemap
IS NULLhot path.lance_table::rowids::RowIdSequence::mask_to_offset_ranges'sU64Segment::Rangearm to consumeiter_runsinstead ofmaterializing the source range and intersecting. Closes the consumer
half.
Runs ∩ Runs→Runs) once acall site materializes that they want the result-side representation
preserved.
Runscan bewritten on the wire too, avoiding the inflate-on-write step. Not
needed for any of (1)–(3) since current call sites build masks
in-memory per query.
🤖 Generated with Claude Code