feat: bloom filter integration for distributed joins#25
feat: bloom filter integration for distributed joins#25
Conversation
- New include/common/bloom_filter.hpp with BloomFilter class - New src/common/bloom_filter.cpp with implementation - Supports configurable expected elements and false positive rate (default 1%) - MurmurHash3-based double hashing for bit positions - serialize()/deserialize() for RPC transfer
- Add BloomFilterPush = 11 to RpcType enum - Add BloomFilterArgs struct with serialize()/deserialize() - Extend ClusterManager with bloom_filters_ map - Add set_bloom_filter(), has_bloom_filter(), get_bloom_filter(), clear_bloom_filter() - Add get_probe_table(), get_probe_key_col() helpers
- Add BloomFilterPush RPC handler in main.cpp - Modify PushData handler to apply bloom filter before buffering - Coordinator sends bloom filter metadata after Phase 1 - Filter application reduces network traffic for low-selectivity joins
- Add 10 unit tests in tests/bloom_filter_test.cpp - Test BloomFilterArgs serialization round-trip - Test ClusterManager bloom filter storage operations - Test bloom filter application logic (PushData simulation) - Update PHASE_6_DISTRIBUTED_JOIN.md with bloom filter docs - Update docs/phases/README.md with bloom filter feature - Update SQLITE_COMPARISON.md with Section 7: Bloom Filter Optimization - Add bloom_filter.cpp and bloom_filter_tests to CMakeLists.txt
📝 WalkthroughWalkthroughAdds Bloom filter support for distributed shuffle-join: new BloomFilter implementation, BloomFilterPush RPC and args, ClusterManager storage APIs, sender-side tuple filtering before PushData, documentation updates, and unit tests plus CMake test target. Changes
Sequence DiagramsequenceDiagram
participant Coord as Coordinator
participant DN as DataNode(s)
participant CM as ClusterManager
participant Catalog as Catalog
Coord->>DN: Phase 1: shuffle left (build)
DN->>CM: buffer left rows
Coord->>Coord: build BloomFilter from buffered rows
Coord->>Coord: serialize BloomFilter -> BloomFilterArgs
Coord->>DN: BloomFilterPush(BloomFilterArgs)
DN->>CM: set_bloom_filter(context_id, build/probe, filter_data)
Coord->>DN: Phase 2: shuffle right (probe)
DN->>CM: get_bloom_filter(context_id)
DN->>Catalog: lookup probe key column index
loop per row
DN->>DN: bloom.might_contain(row[key_idx])?
alt match
DN->>DN: include row in PushData
else not match
DN->>DN: drop row (skip send)
end
end
DN->>Coord: PushData(filtered rows)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/distributed/distributed_executor.cpp (1)
217-241:⚠️ Potential issue | 🟡 MinorPreserve the failing node and remote error for Phase 1.
This now collapses the first
!reply.successinto"Shuffle failed on node during Phase 1", which drops bothnode.idandreply.error_msg. Those strings are surfaced back to clients later, so this makes shard failures much harder to diagnose.Suggested fix
- bool phase1_success = true; for (const auto& node : data_nodes) { network::RpcClient client(node.address, node.cluster_port); if (!client.connect()) { QueryResult res; res.set_error("Failed to connect to node " + node.id + " for shuffle"); @@ } auto reply = network::QueryResultsReply::deserialize(resp); if (!reply.success) { - phase1_success = false; - break; + QueryResult res; + res.set_error("Shuffle failed on node " + node.id + ": " + + reply.error_msg); + return res; } } - - if (!phase1_success) { - QueryResult res; - res.set_error("Shuffle failed on node during Phase 1"); - return res; - }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/distributed/distributed_executor.cpp` around lines 217 - 241, Loop in Phase 1 currently discards the failing node ID and remote error when reply.success is false; modify the branch that sets phase1_success = false (after network::QueryResultsReply::deserialize) to construct and return a QueryResult that includes node.id and reply.error_msg (e.g., "Shuffle failed on node <node.id>: <reply.error_msg>"), or at minimum append the remote error when present; update the QueryResult res.set_error call in that failure path instead of collapsing to the generic "Shuffle failed on node during Phase 1" so callers receive the failing node and remote error details.
🧹 Nitpick comments (1)
tests/bloom_filter_test.cpp (1)
147-165: Avoid deterministic negative assertions against a probabilistic filter.
EXPECT_FALSEfor non-inserted values andEXPECT_EQ(filtered.size(), 2)both assume zero false positives, which Bloom filters do not guarantee. These tests will get flaky if the hash distribution or filter parameters change; assert only the no-false-negative property, or stub the predicate in the PushData-logic test.Also applies to: 235-271
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/bloom_filter_test.cpp` around lines 147 - 165, The test assumes zero false positives which is invalid for BloomFilter; in TEST(BloomFilterTests, DifferentValueTypes) and the other PushData-logic test (referenced by the comment), remove deterministic negative assertions (remove EXPECT_FALSE calls that check non-inserted values against BloomFilter::might_contain) and remove exact-size assertions like EXPECT_EQ(filtered.size(), 2); instead only assert the no-false-negative property by verifying inserted values return true via BloomFilter::might_contain (keep the EXPECT_TRUE checks for Value::make_int64(1), Value::make_int64(2), Value::make_float64(3.14), Value::make_text("string"), Value::make_bool(true)) or, if you must check negatives, assert probabilistically (e.g. allow up to an expected false-positive rate) rather than expecting strictly false.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/performance/SQLITE_COMPARISON.md`:
- Around line 42-46: The "Post-Optimization Enhancements" block is a verbatim
duplicate of Section 5 (including the headings and bullets referencing Buffer
Pool Bypass (fetch_page_by_id), HeapTable::Iterator, and Batch Insert Mode),
which shifted the bloom-filter material to Section 7; remove the duplicate block
so only the original Section 5 remains, renumber the subsequent sections
accordingly, and ensure the bloom-filter content (currently in Section 7)
becomes Section 6; verify headings "Post-Optimization Enhancements", "Buffer
Pool Bypass (fetch_page_by_id)", "HeapTable::Iterator", and "Batch Insert Mode"
only appear once.
- Around line 53-57: The docs incorrectly describe build/probe sides for the
bloom-filter flow; update the wording to match the implementation in
src/distributed/distributed_executor.cpp where the left table is shuffled/built
first and the bloom is made available before the right/probe shuffle (see the
BloomFilterPush RPC used to broadcast filter metadata). Change the bullets to
state the bloom is built from the left/build table, applied to filter the
right/probe table, that each data node constructs its local bloom during the
left/build scan, and that the coordinator’s BloomFilterPush occurs prior to the
right/probe shuffle.
In `@include/network/rpc_message.hpp`:
- Around line 446-504: BloomFilterArgs has a type-width bug: expected_elements
and num_hashes are size_t but serialize/deserialize treat them as 32-bit and
8-bit values, leading to uninitialized upper bytes on 64-bit systems; fix by
adding default initializers in the struct (e.g., expected_elements = 0;
num_hashes = 0) and use fixed-width temporaries during (de)serialization: in
serialize() write a uint32_t tmp_expected =
static_cast<uint32_t>(expected_elements) and uint8_t tmp_hashes =
static_cast<uint8_t>(num_hashes) into the buffer, and in deserialize() read a
uint32_t tmp_expected and a uint8_t tmp_hashes from the buffer then assign
args.expected_elements = static_cast<size_t>(tmp_expected) and args.num_hashes =
tmp_hashes; update code in BloomFilterArgs::serialize() and
BloomFilterArgs::deserialize() accordingly so no partial memcpy into size_t
occurs.
In `@src/common/bloom_filter.cpp`:
- Around line 12-36: Validate inputs at the start of BloomFilter::BloomFilter:
if expected_elements_ == 0, treat this as an empty filter by setting num_bits_ =
0, num_hashes_ = 0 and leaving bits_ empty (return/skip the m/k computation) so
no huge allocation occurs; if false_positive_rate is <= 0 or >= 1, clamp or
replace it with a safe default (e.g. 0.01) before computing m and k to avoid
NaN/inf, then proceed to compute num_bits_, num_hashes_, apply the existing
min/max clamps, and resize bits_ only when num_bits_ > 0.
- Around line 85-91: BloomFilter::get_bit_position currently uses a constant h2
derived from the literal "salt", making the double-hash stride independent of
the key; change h2 to be key-dependent (derive it from the input hash or the
original key bytes) so each key has a distinct secondary hash. Specifically, in
get_bit_position compute h2 by hashing the provided hash (or the original key
bytes when available) with a different seed (e.g., seed2) via murmur3_hash,
ensure h2 is non-zero (e.g., add 1 if needed) and then use (h1 + i * h2) %
num_bits_ so probes cover the full bit space for different keys; update
references to h1, h2 and num_bits_ in that function accordingly.
- Around line 39-58: The BloomFilter constructor (BloomFilter::BloomFilter)
currently accepts truncated payloads by setting num_bits_, num_hashes_, and
expected_elements_ even when the serialized bit array is too short, which can
cause out-of-bounds access in might_contain(); change the constructor to
validate that the provided size contains the full header plus bit_bytes and
reject/truncate the construction if not: after computing size_t bit_bytes =
(num_bits_ + 7) / 8, if size < offset + bit_bytes then reset members (e.g., set
num_bits_, num_hashes_, expected_elements_ to 0 and clear bits_) or return/throw
an error so a half-initialized filter cannot be used by methods like
might_contain(). Ensure you update any callers/tests that expect behavior on
invalid input accordingly.
- Around line 45-51: The code currently memcpy's size_t directly for num_bits_,
num_hashes_, and expected_elements_, which is ABI-dependent; instead
serialize/deserialize using a fixed-width integer (e.g., std::uint64_t) and then
cast to size_t. Concretely, when writing: copy static_cast<uint64_t>(num_bits_)
/ num_hashes_ / expected_elements_ into the buffer via memcpy (or use a htonll
helper) and when reading: memcpy into a uint64_t temp, apply the matching
endianness conversion (ntohll or your RPC's byte-order helpers), then assign via
static_cast<size_t>(temp). Apply the same change for the other block mentioned
(lines 127-133) so both serialization and deserialization use uint64_t with
explicit byte-order conversion rather than raw size_t.
In `@src/main.cpp`:
- Around line 475-518: The bloom filtering is applied on the receiver side after
PushData has already been serialized and sent, so move the bloom-check logic
into the sender path where PushDataArgs / ShuffleFragment is constructed; locate
the code that builds/sends PushData/PushDataArgs in the ShuffleFragment path and
perform the same steps currently in main.cpp (use
cluster_manager->has_bloom_filter, cluster_manager->get_bloom_filter,
cluster_manager->get_probe_key_col and catalog->get_table_by_name to find
key_idx and call bloom.might_contain on each row) before serializing/sending,
and only include the filtered rows in the outgoing PushDataArgs; leave the
receiver-side fallback buffering (cluster_manager->buffer_shuffle_data) intact
but remove the redundant bloom check there.
---
Outside diff comments:
In `@src/distributed/distributed_executor.cpp`:
- Around line 217-241: Loop in Phase 1 currently discards the failing node ID
and remote error when reply.success is false; modify the branch that sets
phase1_success = false (after network::QueryResultsReply::deserialize) to
construct and return a QueryResult that includes node.id and reply.error_msg
(e.g., "Shuffle failed on node <node.id>: <reply.error_msg>"), or at minimum
append the remote error when present; update the QueryResult res.set_error call
in that failure path instead of collapsing to the generic "Shuffle failed on
node during Phase 1" so callers receive the failing node and remote error
details.
---
Nitpick comments:
In `@tests/bloom_filter_test.cpp`:
- Around line 147-165: The test assumes zero false positives which is invalid
for BloomFilter; in TEST(BloomFilterTests, DifferentValueTypes) and the other
PushData-logic test (referenced by the comment), remove deterministic negative
assertions (remove EXPECT_FALSE calls that check non-inserted values against
BloomFilter::might_contain) and remove exact-size assertions like
EXPECT_EQ(filtered.size(), 2); instead only assert the no-false-negative
property by verifying inserted values return true via BloomFilter::might_contain
(keep the EXPECT_TRUE checks for Value::make_int64(1), Value::make_int64(2),
Value::make_float64(3.14), Value::make_text("string"), Value::make_bool(true))
or, if you must check negatives, assert probabilistically (e.g. allow up to an
expected false-positive rate) rather than expecting strictly false.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5aee8ae5-53ea-4ac1-95d9-e434279eb4fe
📒 Files selected for processing (11)
CMakeLists.txtdocs/performance/SQLITE_COMPARISON.mddocs/phases/PHASE_6_DISTRIBUTED_JOIN.mddocs/phases/README.mdinclude/common/bloom_filter.hppinclude/common/cluster_manager.hppinclude/network/rpc_message.hppsrc/common/bloom_filter.cppsrc/distributed/distributed_executor.cppsrc/main.cpptests/bloom_filter_test.cpp
- Use uint64_t with bswap64 for portable serialization - Add input validation for zero expected_elements and invalid FPR - Fix truncated payload handling in BloomFilter constructor - Make h2 in get_bit_position key-dependent via rehashing - Add num_bits_ == 0 guards in insert() and might_contain()
- Fix Phase 1 error to include node.id and reply.error_msg - Move bloom filtering to sender side in ShuffleFragment handler - Remove receiver-side bloom check in PushData handler - Fix tests to only assert no-false-negative property - Remove duplicate section in SQLITE_COMPARISON.md
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
src/main.cpp (2)
592-629: Snapshot bloom-filter metadata once per shuffle.This branch re-fetches/deserializes the same filter and re-walks the schema for every destination node, even though it only depends on
args.context_idandargs.table_name. Hoist the bloom lookup, stored probe-table check, and key-index resolution above the node loop so you only compute it once and don't read cluster state piecemeal.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.cpp` around lines 592 - 629, Hoist the bloom-filter lookup and probe-key/table resolution out of the per-destination loop so the filter and key index are computed only once per shuffle: call cluster_manager->has_bloom_filter(args.context_id), cluster_manager->get_bloom_filter(args.context_id), and cluster_manager->get_probe_key_col(args.context_id) once before iterating destinations, then resolve the probe column via catalog->get_table_by_name(args.table_name) and compute key_idx once (using the same column-name scan currently in the inner branch). If a valid bloom and key_idx exist, keep a single bloom instance and key_idx visible to the node loop and apply the filter to rows_to_send inside that loop (or precompute a filtered rows_to_send once if filtering is independent of destination), removing repeated get_bloom_filter/get_probe_key_col/get_table_by_name calls and the per-node schema walk.
499-503: Move the serialized bitmap intoset_bloom_filter().
args.filter_datais not used after this call, so this path currently copies the full filter payload once more on everyBloomFilterPush.♻️ Small cleanup
cluster_manager->set_bloom_filter( args.context_id, args.build_table, args.probe_table, - args.probe_key_col, args.filter_data, args.expected_elements, + args.probe_key_col, std::move(args.filter_data), args.expected_elements, args.num_hashes);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.cpp` around lines 499 - 503, The BloomFilterPush path is copying the serialized bitmap unnecessarily; update the callsite to transfer ownership of args.filter_data into set_bloom_filter (use std::move(args.filter_data)) and modify the set_bloom_filter signature/implementation to accept the payload by value or rvalue reference so it can take ownership without an extra copy; ensure cluster_manager->set_bloom_filter(...) (and any other callers) are updated to the new parameter type and that args.filter_data is not accessed after the move.tests/bloom_filter_test.cpp (1)
167-188: Round-trip an actual Bloom-filter payload here.The
{0x01, 0x02, 0x03}stub only proves that a byte vector survivesBloomFilterArgs. It doesn't exercise the real header+bitmap payload thatBloomFilter(const uint8_t*, size_t)parses, so a transport bug can still slip through.🧪 Suggested test shape
TEST(BloomFilterTests, BloomFilterArgsSerialization) { + BloomFilter bf(100); + bf.insert(Value::make_int64(42)); + BloomFilterArgs args; args.context_id = "ctx_123"; args.build_table = "users"; args.probe_table = "orders"; args.probe_key_col = "user_id"; - args.filter_data = {0x01, 0x02, 0x03}; - args.expected_elements = 1000; - args.num_hashes = 4; + args.filter_data = bf.serialize(); + args.expected_elements = bf.expected_elements(); + args.num_hashes = bf.num_hashes(); @@ EXPECT_EQ(args.expected_elements, deserialized.expected_elements); EXPECT_EQ(args.num_hashes, deserialized.num_hashes); ASSERT_EQ(args.filter_data.size(), deserialized.filter_data.size()); EXPECT_EQ(args.filter_data, deserialized.filter_data); + + BloomFilter round_trip(deserialized.filter_data.data(), deserialized.filter_data.size()); + EXPECT_TRUE(round_trip.might_contain(Value::make_int64(42))); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/bloom_filter_test.cpp` around lines 167 - 188, The test currently uses a trivial byte vector {0x01,0x02,0x03} which doesn't exercise the real BloomFilter serialization format; instead, construct a real bloom payload (use the BloomFilter builder or create a BloomFilter instance via BloomFilter(const uint8_t*, size_t) / factory that produces header+bitmap), copy its underlying buffer into BloomFilterArgs::filter_data, then serialize/deserialize and additionally construct a BloomFilter from deserialized.filter_data (call BloomFilter(deserialized.filter_data.data(), deserialized.filter_data.size())) to ensure the payload parses correctly and preserves header+bitmap semantics; keep the existing equality checks and add an assertion that the parsed BloomFilter reports the same properties (e.g., expected_elements/num_hashes or bit checks).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/performance/SQLITE_COMPARISON.md`:
- Line 80: Update the Markdown heading "## 8. Future Roadmap" to "## 7. Future
Roadmap" so the section numbering follows the preceding "## 6" heading; locate
the heading text "Future Roadmap" and change the leading number from 8 to 7.
- Around line 67-74: The table incorrectly points to the old PushData sender;
update the row so the sender-side bloom check is attributed to the
RpcType::ShuffleFragment sender path instead of PushData and adjust the Location
column to reference the sender-side shuffle RPC handler (where
RpcType::ShuffleFragment is implemented) so the doc describes the current
pre-send bloom filtering flow.
In `@src/common/bloom_filter.cpp`:
- Around line 68-99: The header fields num_bits_, num_hashes_, and
expected_elements_ are read from untrusted data and must be validated before
using them to compute bit_bytes, resizing bits_, or running loops; clamp them to
sane maximums and reject impossible combinations. After deserializing
tmp_num_bits/tmp_num_hashes/tmp_expected (used to set num_bits_, num_hashes_,
expected_elements_), check that num_bits_ > 0 and that bit_bytes = (num_bits_ +
7)/8 does not overflow and is <= remaining payload (size - offset); also enforce
an upper limit on num_hashes_ (e.g., a reasonable constant to avoid pathological
loops) and on expected_elements_; if any check fails, reset
num_bits_/num_hashes_/expected_elements_, clear bits_, and return; only then
resize bits_ and memcpy the payload into bits_. Ensure all references are to
num_bits_, num_hashes_, expected_elements_, bits_, and the temporary tmp_*
values you already use.
---
Nitpick comments:
In `@src/main.cpp`:
- Around line 592-629: Hoist the bloom-filter lookup and probe-key/table
resolution out of the per-destination loop so the filter and key index are
computed only once per shuffle: call
cluster_manager->has_bloom_filter(args.context_id),
cluster_manager->get_bloom_filter(args.context_id), and
cluster_manager->get_probe_key_col(args.context_id) once before iterating
destinations, then resolve the probe column via
catalog->get_table_by_name(args.table_name) and compute key_idx once (using the
same column-name scan currently in the inner branch). If a valid bloom and
key_idx exist, keep a single bloom instance and key_idx visible to the node loop
and apply the filter to rows_to_send inside that loop (or precompute a filtered
rows_to_send once if filtering is independent of destination), removing repeated
get_bloom_filter/get_probe_key_col/get_table_by_name calls and the per-node
schema walk.
- Around line 499-503: The BloomFilterPush path is copying the serialized bitmap
unnecessarily; update the callsite to transfer ownership of args.filter_data
into set_bloom_filter (use std::move(args.filter_data)) and modify the
set_bloom_filter signature/implementation to accept the payload by value or
rvalue reference so it can take ownership without an extra copy; ensure
cluster_manager->set_bloom_filter(...) (and any other callers) are updated to
the new parameter type and that args.filter_data is not accessed after the move.
In `@tests/bloom_filter_test.cpp`:
- Around line 167-188: The test currently uses a trivial byte vector
{0x01,0x02,0x03} which doesn't exercise the real BloomFilter serialization
format; instead, construct a real bloom payload (use the BloomFilter builder or
create a BloomFilter instance via BloomFilter(const uint8_t*, size_t) / factory
that produces header+bitmap), copy its underlying buffer into
BloomFilterArgs::filter_data, then serialize/deserialize and additionally
construct a BloomFilter from deserialized.filter_data (call
BloomFilter(deserialized.filter_data.data(), deserialized.filter_data.size()))
to ensure the payload parses correctly and preserves header+bitmap semantics;
keep the existing equality checks and add an assertion that the parsed
BloomFilter reports the same properties (e.g., expected_elements/num_hashes or
bit checks).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a526f028-ed56-470a-b7ab-a72ffd759bac
📒 Files selected for processing (6)
docs/performance/SQLITE_COMPARISON.mdinclude/network/rpc_message.hppsrc/common/bloom_filter.cppsrc/distributed/distributed_executor.cppsrc/main.cpptests/bloom_filter_test.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- include/network/rpc_message.hpp
- src/distributed/distributed_executor.cpp
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/performance/SQLITE_COMPARISON.md`:
- Around line 54-65: The fenced diagram block lacks a language tag (MD040);
update the code fence around the ASCII diagram (the triple-backtick block
containing "[Phase 1: Shuffle Left] [Phase 2: Shuffle Right]" and the
diagram lines) to include a language identifier such as text (e.g., ```text) so
the linter recognizes the block correctly.
- Around line 67-76: The markdown table under the "### Key Components" heading
in SQLITE_COMPARISON.md must be surrounded by blank lines to satisfy
markdownlint MD058; edit the block that lists the components (the header "###
Key Components" and the following table rows including `BloomFilter`,
`BloomFilterArgs`, `ClusterManager`, `PushData`, `ShuffleFragment`, and
`Coordinator`) and add one blank line before the table and one blank line after
the table so the table is separated from surrounding text.
In `@tests/bloom_filter_test.cpp`:
- Around line 245-259: Update the test comment references in
BloomFilterTests::BloomFilterApplicationLogic to replace stale mentions of the
PushData handler with the correct sender-side shuffle path terminology (e.g.,
RpcType::ShuffleFragment) so comments match implementation; locate the comment
lines around the simulated tuple filtering and change "PushData" to
"ShuffleFragment" or "sender-side shuffle" and adjust any accompanying wording
to reflect that the bloom check happens on the sender shuffle path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e9744677-040b-43c1-87f4-a3155c63413d
📒 Files selected for processing (4)
docs/performance/SQLITE_COMPARISON.mdsrc/common/bloom_filter.cppsrc/main.cpptests/bloom_filter_test.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- src/main.cpp
- src/common/bloom_filter.cpp
| ``` | ||
| [Phase 1: Shuffle Left] [Phase 2: Shuffle Right] | ||
| | | | ||
| v v | ||
| Build local bloom Apply bloom filter | ||
| from join keys before buffering | ||
| | | | ||
| +---- BloomFilterPush ----->---+ | ||
| (filter metadata) | | ||
| v | ||
| Filtered tuples buffered | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to the fenced diagram block.
The code fence is missing a language identifier, which triggers MD040.
Suggested patch
-```
+```text
[Phase 1: Shuffle Left] [Phase 2: Shuffle Right]
| |
v v
Build local bloom Apply bloom filter
from join keys before buffering
| |
+---- BloomFilterPush ----->---+
(filter metadata) |
v
Filtered tuples buffered</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>
[warning] 54-54: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @docs/performance/SQLITE_COMPARISON.md around lines 54 - 65, The fenced
diagram block lacks a language tag (MD040); update the code fence around the
ASCII diagram (the triple-backtick block containing "[Phase 1: Shuffle Left]
[Phase 2: Shuffle Right]" and the diagram lines) to include a language
identifier such as text (e.g., ```text) so the linter recognizes the block
correctly.
</details>
<!-- fingerprinting:phantom:triton:hawk:af1faed4-8e82-4fde-8eff-fae2e9a9c25f -->
<!-- This is an auto-generated comment by CodeRabbit -->
| ### Key Components | ||
| | Component | Location | Purpose | | ||
| |-----------|----------|---------| | ||
| | `BloomFilter` class | `include/common/bloom_filter.hpp` | MurmurHash3-based bloom filter | | ||
| | `BloomFilterArgs` RPC | `include/network/rpc_message.hpp` | Serialization for network transfer | | ||
| | `ClusterManager` storage | `include/common/cluster_manager.hpp` | Stores bloom filter per context | | ||
| | `PushData` handler | `src/main.cpp` | Receives and buffers filtered tuples | | ||
| | `ShuffleFragment` handler | `src/main.cpp` | Applies bloom filter before sending | | ||
| | Coordinator | `src/distributed/distributed_executor.cpp` | Broadcasts filter after Phase 1 | | ||
|
|
There was a problem hiding this comment.
Surround the table with blank lines to satisfy markdownlint.
Add a blank line before/after the table block (MD058).
Suggested patch
### Key Components
+
| Component | Location | Purpose |
|-----------|----------|---------|
| `BloomFilter` class | `include/common/bloom_filter.hpp` | MurmurHash3-based bloom filter |
| `BloomFilterArgs` RPC | `include/network/rpc_message.hpp` | Serialization for network transfer |
| `ClusterManager` storage | `include/common/cluster_manager.hpp` | Stores bloom filter per context |
| `PushData` handler | `src/main.cpp` | Receives and buffers filtered tuples |
| `ShuffleFragment` handler | `src/main.cpp` | Applies bloom filter before sending |
| Coordinator | `src/distributed/distributed_executor.cpp` | Broadcasts filter after Phase 1 |
+
### Test Coverage🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 68-68: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/performance/SQLITE_COMPARISON.md` around lines 67 - 76, The markdown
table under the "### Key Components" heading in SQLITE_COMPARISON.md must be
surrounded by blank lines to satisfy markdownlint MD058; edit the block that
lists the components (the header "### Key Components" and the following table
rows including `BloomFilter`, `BloomFilterArgs`, `ClusterManager`, `PushData`,
`ShuffleFragment`, and `Coordinator`) and add one blank line before the table
and one blank line after the table so the table is separated from surrounding
text.
| * @brief Tests bloom filter application logic (simulates PushData handler behavior). | ||
| */ | ||
| TEST(BloomFilterTests, BloomFilterApplicationLogic) { | ||
| // Build bloom filter with known keys | ||
| BloomFilter bf(100); | ||
| bf.insert(Value::make_int64(10)); | ||
| bf.insert(Value::make_int64(20)); | ||
| bf.insert(Value::make_int64(30)); | ||
|
|
||
| // Verify no-false-negative: inserted values must be found via might_contain | ||
| EXPECT_TRUE(bf.might_contain(Value::make_int64(10))); | ||
| EXPECT_TRUE(bf.might_contain(Value::make_int64(20))); | ||
| EXPECT_TRUE(bf.might_contain(Value::make_int64(30))); | ||
|
|
||
| // Simulate tuple filtering (as done in PushData handler) |
There was a problem hiding this comment.
Update stale handler naming in test comments.
These comments still reference PushData, but the bloom check is documented/implemented on the sender-side shuffle path (RpcType::ShuffleFragment). Rename the comments to avoid drift.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/bloom_filter_test.cpp` around lines 245 - 259, Update the test comment
references in BloomFilterTests::BloomFilterApplicationLogic to replace stale
mentions of the PushData handler with the correct sender-side shuffle path
terminology (e.g., RpcType::ShuffleFragment) so comments match implementation;
locate the comment lines around the simulated tuple filtering and change
"PushData" to "ShuffleFragment" or "sender-side shuffle" and adjust any
accompanying wording to reflect that the bloom check happens on the sender
shuffle path.
Summary
Test Plan
make bloom_filter_tests && ./bloom_filter_tests(10 tests pass)make sqlEngineCoremake run-testsChanges
include/common/bloom_filter.hpp,src/common/bloom_filter.cppinclude/network/rpc_message.hppinclude/common/cluster_manager.hppsrc/main.cpp,src/distributed/distributed_executor.cpptests/bloom_filter_test.cppdocs/phases/PHASE_6_DISTRIBUTED_JOIN.md,docs/performance/SQLITE_COMPARISON.mdSummary by CodeRabbit
New Features
Tests
Documentation