feat: bloom filter integration for distributed joins by poyrazK · Pull Request #25 · poyrazK/cloudSQL

poyrazK · 2026-04-11T14:41:33Z

Summary

Implement BloomFilter class with MurmurHash3 for probabilistic tuple filtering
Add BloomFilterPush RPC type and BloomFilterArgs for network transfer
Extend ClusterManager to store bloom filter per shuffle context
Integrate bloom filter application in PushData handler
Add 10 unit tests with full coverage of new components

Test Plan

Unit tests: make bloom_filter_tests && ./bloom_filter_tests (10 tests pass)
Core library builds: make sqlEngineCore
Existing tests pass: make run-tests

Changes

Component	Files
BloomFilter class	`include/common/bloom_filter.hpp`, `src/common/bloom_filter.cpp`
RPC type	`include/network/rpc_message.hpp`
ClusterManager	`include/common/cluster_manager.hpp`
Integration	`src/main.cpp`, `src/distributed/distributed_executor.cpp`
Tests	`tests/bloom_filter_test.cpp`
Docs	`docs/phases/PHASE_6_DISTRIBUTED_JOIN.md`, `docs/performance/SQLITE_COMPARISON.md`

Summary by CodeRabbit

New Features
- Bloom filter optimization for distributed joins: broadcastable filters reduce network traffic by pre-filtering rows before shuffles.
Tests
- Added comprehensive unit tests covering bloom filter functionality, serialization/round‑trip, and integration with cluster state and tuple filtering.
Documentation
- Updated performance and phase docs with bloom-filter design, architecture diagram, integration notes, and roadmap updates.

- New include/common/bloom_filter.hpp with BloomFilter class - New src/common/bloom_filter.cpp with implementation - Supports configurable expected elements and false positive rate (default 1%) - MurmurHash3-based double hashing for bit positions - serialize()/deserialize() for RPC transfer

- Add BloomFilterPush = 11 to RpcType enum - Add BloomFilterArgs struct with serialize()/deserialize() - Extend ClusterManager with bloom_filters_ map - Add set_bloom_filter(), has_bloom_filter(), get_bloom_filter(), clear_bloom_filter() - Add get_probe_table(), get_probe_key_col() helpers

- Add BloomFilterPush RPC handler in main.cpp - Modify PushData handler to apply bloom filter before buffering - Coordinator sends bloom filter metadata after Phase 1 - Filter application reduces network traffic for low-selectivity joins

- Add 10 unit tests in tests/bloom_filter_test.cpp - Test BloomFilterArgs serialization round-trip - Test ClusterManager bloom filter storage operations - Test bloom filter application logic (PushData simulation) - Update PHASE_6_DISTRIBUTED_JOIN.md with bloom filter docs - Update docs/phases/README.md with bloom filter feature - Update SQLITE_COMPARISON.md with Section 7: Bloom Filter Optimization - Add bloom_filter.cpp and bloom_filter_tests to CMakeLists.txt

coderabbitai · 2026-04-11T14:41:48Z

📝 Walkthrough

Walkthrough

Adds Bloom filter support for distributed shuffle-join: new BloomFilter implementation, BloomFilterPush RPC and args, ClusterManager storage APIs, sender-side tuple filtering before PushData, documentation updates, and unit tests plus CMake test target.

Changes

Cohort / File(s)	Summary
Build Configuration `CMakeLists.txt`	Added `src/common/bloom_filter.cpp` to `sqlEngineCore` and new `bloom_filter_tests` GoogleTest executable from `tests/bloom_filter_test.cpp`.
BloomFilter Core `include/common/bloom_filter.hpp`, `src/common/bloom_filter.cpp`	New `cloudsql::common::BloomFilter` with constructors (from params and serialized bytes), `insert()`, `might_contain()`, Murmur3 hashing, double-hashing bit positions, and `serialize()` format.
Cluster Manager Integration `include/common/cluster_manager.hpp`	Added `BloomFilterEntry`, `bloom_filters_` map, and public methods: `set_bloom_filter`, `has_bloom_filter`, `get_bloom_filter`, `get_probe_table`, `get_probe_key_col`, `clear_bloom_filter`.
RPC & Message Types `include/network/rpc_message.hpp`	Added `RpcType::BloomFilterPush = 11` and `BloomFilterArgs` with `serialize()`/`deserialize()` including `filter_data`, `expected_elements`, and `num_hashes`.
Distributed Executor `src/distributed/distributed_executor.cpp`	After Phase 1 (build-side shuffle), construct and broadcast `BloomFilterArgs` (best-effort) to data nodes before Phase 2 (probe-side shuffle); added include and small control-flow adjustments.
RPC Handlers & Push Filtering `src/main.cpp`	Added handler for `BloomFilterPush` to store filters in ClusterManager; sender-side now filters rows using bloom.might_contain(row[key_idx]) when bloom info present; receiver `PushData` buffers already-filtered rows.
Tests `tests/bloom_filter_test.cpp`	New GoogleTest suite covering BloomFilter behavior, serialization round-trip, BloomFilterArgs ser/deser, ClusterManager storage APIs, and simulated sender-side tuple filtering.
Documentation `docs/performance/SQLITE_COMPARISON.md`, `docs/phases/PHASE_6_DISTRIBUTED_JOIN.md`, `docs/phases/README.md`	Documented distributed join Bloom-filter optimization, protocol message, architecture, stages, and test coverage; replaced roadmap section with Bloom-filter details.

Sequence Diagram

sequenceDiagram
    participant Coord as Coordinator
    participant DN as DataNode(s)
    participant CM as ClusterManager
    participant Catalog as Catalog

    Coord->>DN: Phase 1: shuffle left (build)
    DN->>CM: buffer left rows

    Coord->>Coord: build BloomFilter from buffered rows
    Coord->>Coord: serialize BloomFilter -> BloomFilterArgs

    Coord->>DN: BloomFilterPush(BloomFilterArgs)
    DN->>CM: set_bloom_filter(context_id, build/probe, filter_data)

    Coord->>DN: Phase 2: shuffle right (probe)
    DN->>CM: get_bloom_filter(context_id)
    DN->>Catalog: lookup probe key column index
    loop per row
      DN->>DN: bloom.might_contain(row[key_idx])?
      alt match
        DN->>DN: include row in PushData
      else not match
        DN->>DN: drop row (skip send)
      end
    end
    DN->>Coord: PushData(filtered rows)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: implement distributed Shuffle Join (Phase 6) #6: Related changes to distributed shuffle-join paths and shuffle infrastructure that BloomFilter propagation builds on.
chore: migrate custom testing framework to GoogleTest #2: Introduced the add_cloudsql_test CMake macro used to add the new bloom_filter_tests target.

Suggested labels

documentation, enhancement

Poem

🐰 I hopped through bits and set a few,
Blooming filters, whispers true,
I trim the shuffle, skip the noise,
Tiny hashes, joyful poise,
A rabbit's hop — less data, more joy! 🎋

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: bloom filter integration for distributed joins' accurately and concisely summarizes the main change: adding bloom filter support to the distributed join infrastructure.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch release/bloom-filter-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/distributed/distributed_executor.cpp (1)

217-241: ⚠️ Potential issue | 🟡 Minor

Preserve the failing node and remote error for Phase 1.

This now collapses the first !reply.success into "Shuffle failed on node during Phase 1", which drops both node.id and reply.error_msg. Those strings are surfaced back to clients later, so this makes shard failures much harder to diagnose.

Suggested fix

-                bool phase1_success = true;
                 for (const auto& node : data_nodes) {
                     network::RpcClient client(node.address, node.cluster_port);
                     if (!client.connect()) {
                         QueryResult res;
                         res.set_error("Failed to connect to node " + node.id + " for shuffle");
@@
                     }
                     auto reply = network::QueryResultsReply::deserialize(resp);
                     if (!reply.success) {
-                        phase1_success = false;
-                        break;
+                        QueryResult res;
+                        res.set_error("Shuffle failed on node " + node.id + ": " +
+                                      reply.error_msg);
+                        return res;
                     }
                 }
-
-                if (!phase1_success) {
-                    QueryResult res;
-                    res.set_error("Shuffle failed on node during Phase 1");
-                    return res;
-                }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/distributed/distributed_executor.cpp` around lines 217 - 241, Loop in
Phase 1 currently discards the failing node ID and remote error when
reply.success is false; modify the branch that sets phase1_success = false
(after network::QueryResultsReply::deserialize) to construct and return a
QueryResult that includes node.id and reply.error_msg (e.g., "Shuffle failed on
node <node.id>: <reply.error_msg>"), or at minimum append the remote error when
present; update the QueryResult res.set_error call in that failure path instead
of collapsing to the generic "Shuffle failed on node during Phase 1" so callers
receive the failing node and remote error details.

🧹 Nitpick comments (1)

tests/bloom_filter_test.cpp (1)
147-165: Avoid deterministic negative assertions against a probabilistic filter.

EXPECT_FALSE for non-inserted values and EXPECT_EQ(filtered.size(), 2) both assume zero false positives, which Bloom filters do not guarantee. These tests will get flaky if the hash distribution or filter parameters change; assert only the no-false-negative property, or stub the predicate in the PushData-logic test.

Also applies to: 235-271
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/bloom_filter_test.cpp` around lines 147 - 165, The test assumes zero
false positives which is invalid for BloomFilter; in TEST(BloomFilterTests,
DifferentValueTypes) and the other PushData-logic test (referenced by the
comment), remove deterministic negative assertions (remove EXPECT_FALSE calls
that check non-inserted values against BloomFilter::might_contain) and remove
exact-size assertions like EXPECT_EQ(filtered.size(), 2); instead only assert
the no-false-negative property by verifying inserted values return true via
BloomFilter::might_contain (keep the EXPECT_TRUE checks for
Value::make_int64(1), Value::make_int64(2), Value::make_float64(3.14),
Value::make_text("string"), Value::make_bool(true)) or, if you must check
negatives, assert probabilistically (e.g. allow up to an expected false-positive
rate) rather than expecting strictly false.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/performance/SQLITE_COMPARISON.md`:
- Around line 42-46: The "Post-Optimization Enhancements" block is a verbatim
duplicate of Section 5 (including the headings and bullets referencing Buffer
Pool Bypass (fetch_page_by_id), HeapTable::Iterator, and Batch Insert Mode),
which shifted the bloom-filter material to Section 7; remove the duplicate block
so only the original Section 5 remains, renumber the subsequent sections
accordingly, and ensure the bloom-filter content (currently in Section 7)
becomes Section 6; verify headings "Post-Optimization Enhancements", "Buffer
Pool Bypass (fetch_page_by_id)", "HeapTable::Iterator", and "Batch Insert Mode"
only appear once.
- Around line 53-57: The docs incorrectly describe build/probe sides for the
bloom-filter flow; update the wording to match the implementation in
src/distributed/distributed_executor.cpp where the left table is shuffled/built
first and the bloom is made available before the right/probe shuffle (see the
BloomFilterPush RPC used to broadcast filter metadata). Change the bullets to
state the bloom is built from the left/build table, applied to filter the
right/probe table, that each data node constructs its local bloom during the
left/build scan, and that the coordinator’s BloomFilterPush occurs prior to the
right/probe shuffle.

In `@include/network/rpc_message.hpp`:
- Around line 446-504: BloomFilterArgs has a type-width bug: expected_elements
and num_hashes are size_t but serialize/deserialize treat them as 32-bit and
8-bit values, leading to uninitialized upper bytes on 64-bit systems; fix by
adding default initializers in the struct (e.g., expected_elements = 0;
num_hashes = 0) and use fixed-width temporaries during (de)serialization: in
serialize() write a uint32_t tmp_expected =
static_cast<uint32_t>(expected_elements) and uint8_t tmp_hashes =
static_cast<uint8_t>(num_hashes) into the buffer, and in deserialize() read a
uint32_t tmp_expected and a uint8_t tmp_hashes from the buffer then assign
args.expected_elements = static_cast<size_t>(tmp_expected) and args.num_hashes =
tmp_hashes; update code in BloomFilterArgs::serialize() and
BloomFilterArgs::deserialize() accordingly so no partial memcpy into size_t
occurs.

In `@src/common/bloom_filter.cpp`:
- Around line 12-36: Validate inputs at the start of BloomFilter::BloomFilter:
if expected_elements_ == 0, treat this as an empty filter by setting num_bits_ =
0, num_hashes_ = 0 and leaving bits_ empty (return/skip the m/k computation) so
no huge allocation occurs; if false_positive_rate is <= 0 or >= 1, clamp or
replace it with a safe default (e.g. 0.01) before computing m and k to avoid
NaN/inf, then proceed to compute num_bits_, num_hashes_, apply the existing
min/max clamps, and resize bits_ only when num_bits_ > 0.
- Around line 85-91: BloomFilter::get_bit_position currently uses a constant h2
derived from the literal "salt", making the double-hash stride independent of
the key; change h2 to be key-dependent (derive it from the input hash or the
original key bytes) so each key has a distinct secondary hash. Specifically, in
get_bit_position compute h2 by hashing the provided hash (or the original key
bytes when available) with a different seed (e.g., seed2) via murmur3_hash,
ensure h2 is non-zero (e.g., add 1 if needed) and then use (h1 + i * h2) %
num_bits_ so probes cover the full bit space for different keys; update
references to h1, h2 and num_bits_ in that function accordingly.
- Around line 39-58: The BloomFilter constructor (BloomFilter::BloomFilter)
currently accepts truncated payloads by setting num_bits_, num_hashes_, and
expected_elements_ even when the serialized bit array is too short, which can
cause out-of-bounds access in might_contain(); change the constructor to
validate that the provided size contains the full header plus bit_bytes and
reject/truncate the construction if not: after computing size_t bit_bytes =
(num_bits_ + 7) / 8, if size < offset + bit_bytes then reset members (e.g., set
num_bits_, num_hashes_, expected_elements_ to 0 and clear bits_) or return/throw
an error so a half-initialized filter cannot be used by methods like
might_contain(). Ensure you update any callers/tests that expect behavior on
invalid input accordingly.
- Around line 45-51: The code currently memcpy's size_t directly for num_bits_,
num_hashes_, and expected_elements_, which is ABI-dependent; instead
serialize/deserialize using a fixed-width integer (e.g., std::uint64_t) and then
cast to size_t. Concretely, when writing: copy static_cast<uint64_t>(num_bits_)
/ num_hashes_ / expected_elements_ into the buffer via memcpy (or use a htonll
helper) and when reading: memcpy into a uint64_t temp, apply the matching
endianness conversion (ntohll or your RPC's byte-order helpers), then assign via
static_cast<size_t>(temp). Apply the same change for the other block mentioned
(lines 127-133) so both serialization and deserialization use uint64_t with
explicit byte-order conversion rather than raw size_t.

In `@src/main.cpp`:
- Around line 475-518: The bloom filtering is applied on the receiver side after
PushData has already been serialized and sent, so move the bloom-check logic
into the sender path where PushDataArgs / ShuffleFragment is constructed; locate
the code that builds/sends PushData/PushDataArgs in the ShuffleFragment path and
perform the same steps currently in main.cpp (use
cluster_manager->has_bloom_filter, cluster_manager->get_bloom_filter,
cluster_manager->get_probe_key_col and catalog->get_table_by_name to find
key_idx and call bloom.might_contain on each row) before serializing/sending,
and only include the filtered rows in the outgoing PushDataArgs; leave the
receiver-side fallback buffering (cluster_manager->buffer_shuffle_data) intact
but remove the redundant bloom check there.

---

Outside diff comments:
In `@src/distributed/distributed_executor.cpp`:
- Around line 217-241: Loop in Phase 1 currently discards the failing node ID
and remote error when reply.success is false; modify the branch that sets
phase1_success = false (after network::QueryResultsReply::deserialize) to
construct and return a QueryResult that includes node.id and reply.error_msg
(e.g., "Shuffle failed on node <node.id>: <reply.error_msg>"), or at minimum
append the remote error when present; update the QueryResult res.set_error call
in that failure path instead of collapsing to the generic "Shuffle failed on
node during Phase 1" so callers receive the failing node and remote error
details.

---

Nitpick comments:
In `@tests/bloom_filter_test.cpp`:
- Around line 147-165: The test assumes zero false positives which is invalid
for BloomFilter; in TEST(BloomFilterTests, DifferentValueTypes) and the other
PushData-logic test (referenced by the comment), remove deterministic negative
assertions (remove EXPECT_FALSE calls that check non-inserted values against
BloomFilter::might_contain) and remove exact-size assertions like
EXPECT_EQ(filtered.size(), 2); instead only assert the no-false-negative
property by verifying inserted values return true via BloomFilter::might_contain
(keep the EXPECT_TRUE checks for Value::make_int64(1), Value::make_int64(2),
Value::make_float64(3.14), Value::make_text("string"), Value::make_bool(true))
or, if you must check negatives, assert probabilistically (e.g. allow up to an
expected false-positive rate) rather than expecting strictly false.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5aee8ae5-53ea-4ac1-95d9-e434279eb4fe

📥 Commits

Reviewing files that changed from the base of the PR and between 267a75a and 5154dde.

📒 Files selected for processing (11)

CMakeLists.txt
docs/performance/SQLITE_COMPARISON.md
docs/phases/PHASE_6_DISTRIBUTED_JOIN.md
docs/phases/README.md
include/common/bloom_filter.hpp
include/common/cluster_manager.hpp
include/network/rpc_message.hpp
src/common/bloom_filter.cpp
src/distributed/distributed_executor.cpp
src/main.cpp
tests/bloom_filter_test.cpp

docs/performance/SQLITE_COMPARISON.md

include/network/rpc_message.hpp

src/common/bloom_filter.cpp

src/main.cpp

- Use uint64_t with bswap64 for portable serialization - Add input validation for zero expected_elements and invalid FPR - Fix truncated payload handling in BloomFilter constructor - Make h2 in get_bit_position key-dependent via rehashing - Add num_bits_ == 0 guards in insert() and might_contain()

- Fix Phase 1 error to include node.id and reply.error_msg - Move bloom filtering to sender side in ShuffleFragment handler - Remove receiver-side bloom check in PushData handler - Fix tests to only assert no-false-negative property - Remove duplicate section in SQLITE_COMPARISON.md

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

src/main.cpp (2)

592-629: Snapshot bloom-filter metadata once per shuffle.

This branch re-fetches/deserializes the same filter and re-walks the schema for every destination node, even though it only depends on args.context_id and args.table_name. Hoist the bloom lookup, stored probe-table check, and key-index resolution above the node loop so you only compute it once and don't read cluster state piecemeal.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.cpp` around lines 592 - 629, Hoist the bloom-filter lookup and
probe-key/table resolution out of the per-destination loop so the filter and key
index are computed only once per shuffle: call
cluster_manager->has_bloom_filter(args.context_id),
cluster_manager->get_bloom_filter(args.context_id), and
cluster_manager->get_probe_key_col(args.context_id) once before iterating
destinations, then resolve the probe column via
catalog->get_table_by_name(args.table_name) and compute key_idx once (using the
same column-name scan currently in the inner branch). If a valid bloom and
key_idx exist, keep a single bloom instance and key_idx visible to the node loop
and apply the filter to rows_to_send inside that loop (or precompute a filtered
rows_to_send once if filtering is independent of destination), removing repeated
get_bloom_filter/get_probe_key_col/get_table_by_name calls and the per-node
schema walk.

499-503: Move the serialized bitmap into set_bloom_filter().

args.filter_data is not used after this call, so this path currently copies the full filter payload once more on every BloomFilterPush.

♻️ Small cleanup

                             cluster_manager->set_bloom_filter(
                                 args.context_id, args.build_table, args.probe_table,
-                                args.probe_key_col, args.filter_data, args.expected_elements,
+                                args.probe_key_col, std::move(args.filter_data), args.expected_elements,
                                 args.num_hashes);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.cpp` around lines 499 - 503, The BloomFilterPush path is copying the
serialized bitmap unnecessarily; update the callsite to transfer ownership of
args.filter_data into set_bloom_filter (use std::move(args.filter_data)) and
modify the set_bloom_filter signature/implementation to accept the payload by
value or rvalue reference so it can take ownership without an extra copy; ensure
cluster_manager->set_bloom_filter(...) (and any other callers) are updated to
the new parameter type and that args.filter_data is not accessed after the move.

tests/bloom_filter_test.cpp (1)

167-188: Round-trip an actual Bloom-filter payload here.

The {0x01, 0x02, 0x03} stub only proves that a byte vector survives BloomFilterArgs. It doesn't exercise the real header+bitmap payload that BloomFilter(const uint8_t*, size_t) parses, so a transport bug can still slip through.

🧪 Suggested test shape

 TEST(BloomFilterTests, BloomFilterArgsSerialization) {
+    BloomFilter bf(100);
+    bf.insert(Value::make_int64(42));
+
     BloomFilterArgs args;
     args.context_id = "ctx_123";
     args.build_table = "users";
     args.probe_table = "orders";
     args.probe_key_col = "user_id";
-    args.filter_data = {0x01, 0x02, 0x03};
-    args.expected_elements = 1000;
-    args.num_hashes = 4;
+    args.filter_data = bf.serialize();
+    args.expected_elements = bf.expected_elements();
+    args.num_hashes = bf.num_hashes();
@@
     EXPECT_EQ(args.expected_elements, deserialized.expected_elements);
     EXPECT_EQ(args.num_hashes, deserialized.num_hashes);
     ASSERT_EQ(args.filter_data.size(), deserialized.filter_data.size());
     EXPECT_EQ(args.filter_data, deserialized.filter_data);
+
+    BloomFilter round_trip(deserialized.filter_data.data(), deserialized.filter_data.size());
+    EXPECT_TRUE(round_trip.might_contain(Value::make_int64(42)));
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/bloom_filter_test.cpp` around lines 167 - 188, The test currently uses
a trivial byte vector {0x01,0x02,0x03} which doesn't exercise the real
BloomFilter serialization format; instead, construct a real bloom payload (use
the BloomFilter builder or create a BloomFilter instance via BloomFilter(const
uint8_t*, size_t) / factory that produces header+bitmap), copy its underlying
buffer into BloomFilterArgs::filter_data, then serialize/deserialize and
additionally construct a BloomFilter from deserialized.filter_data (call
BloomFilter(deserialized.filter_data.data(), deserialized.filter_data.size()))
to ensure the payload parses correctly and preserves header+bitmap semantics;
keep the existing equality checks and add an assertion that the parsed
BloomFilter reports the same properties (e.g., expected_elements/num_hashes or
bit checks).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/performance/SQLITE_COMPARISON.md`:
- Line 80: Update the Markdown heading "## 8. Future Roadmap" to "## 7. Future
Roadmap" so the section numbering follows the preceding "## 6" heading; locate
the heading text "Future Roadmap" and change the leading number from 8 to 7.
- Around line 67-74: The table incorrectly points to the old PushData sender;
update the row so the sender-side bloom check is attributed to the
RpcType::ShuffleFragment sender path instead of PushData and adjust the Location
column to reference the sender-side shuffle RPC handler (where
RpcType::ShuffleFragment is implemented) so the doc describes the current
pre-send bloom filtering flow.

In `@src/common/bloom_filter.cpp`:
- Around line 68-99: The header fields num_bits_, num_hashes_, and
expected_elements_ are read from untrusted data and must be validated before
using them to compute bit_bytes, resizing bits_, or running loops; clamp them to
sane maximums and reject impossible combinations. After deserializing
tmp_num_bits/tmp_num_hashes/tmp_expected (used to set num_bits_, num_hashes_,
expected_elements_), check that num_bits_ > 0 and that bit_bytes = (num_bits_ +
7)/8 does not overflow and is <= remaining payload (size - offset); also enforce
an upper limit on num_hashes_ (e.g., a reasonable constant to avoid pathological
loops) and on expected_elements_; if any check fails, reset
num_bits_/num_hashes_/expected_elements_, clear bits_, and return; only then
resize bits_ and memcpy the payload into bits_. Ensure all references are to
num_bits_, num_hashes_, expected_elements_, bits_, and the temporary tmp_*
values you already use.

---

Nitpick comments:
In `@src/main.cpp`:
- Around line 592-629: Hoist the bloom-filter lookup and probe-key/table
resolution out of the per-destination loop so the filter and key index are
computed only once per shuffle: call
cluster_manager->has_bloom_filter(args.context_id),
cluster_manager->get_bloom_filter(args.context_id), and
cluster_manager->get_probe_key_col(args.context_id) once before iterating
destinations, then resolve the probe column via
catalog->get_table_by_name(args.table_name) and compute key_idx once (using the
same column-name scan currently in the inner branch). If a valid bloom and
key_idx exist, keep a single bloom instance and key_idx visible to the node loop
and apply the filter to rows_to_send inside that loop (or precompute a filtered
rows_to_send once if filtering is independent of destination), removing repeated
get_bloom_filter/get_probe_key_col/get_table_by_name calls and the per-node
schema walk.
- Around line 499-503: The BloomFilterPush path is copying the serialized bitmap
unnecessarily; update the callsite to transfer ownership of args.filter_data
into set_bloom_filter (use std::move(args.filter_data)) and modify the
set_bloom_filter signature/implementation to accept the payload by value or
rvalue reference so it can take ownership without an extra copy; ensure
cluster_manager->set_bloom_filter(...) (and any other callers) are updated to
the new parameter type and that args.filter_data is not accessed after the move.

In `@tests/bloom_filter_test.cpp`:
- Around line 167-188: The test currently uses a trivial byte vector
{0x01,0x02,0x03} which doesn't exercise the real BloomFilter serialization
format; instead, construct a real bloom payload (use the BloomFilter builder or
create a BloomFilter instance via BloomFilter(const uint8_t*, size_t) / factory
that produces header+bitmap), copy its underlying buffer into
BloomFilterArgs::filter_data, then serialize/deserialize and additionally
construct a BloomFilter from deserialized.filter_data (call
BloomFilter(deserialized.filter_data.data(), deserialized.filter_data.size()))
to ensure the payload parses correctly and preserves header+bitmap semantics;
keep the existing equality checks and add an assertion that the parsed
BloomFilter reports the same properties (e.g., expected_elements/num_hashes or
bit checks).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a526f028-ed56-470a-b7ab-a72ffd759bac

📥 Commits

Reviewing files that changed from the base of the PR and between 5154dde and bfd846c.

📒 Files selected for processing (6)

docs/performance/SQLITE_COMPARISON.md
include/network/rpc_message.hpp
src/common/bloom_filter.cpp
src/distributed/distributed_executor.cpp
src/main.cpp
tests/bloom_filter_test.cpp

🚧 Files skipped from review as they are similar to previous changes (2)

include/network/rpc_message.hpp
src/distributed/distributed_executor.cpp

docs/performance/SQLITE_COMPARISON.md

src/common/bloom_filter.cpp

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/performance/SQLITE_COMPARISON.md`:
- Around line 54-65: The fenced diagram block lacks a language tag (MD040);
update the code fence around the ASCII diagram (the triple-backtick block
containing "[Phase 1: Shuffle Left]     [Phase 2: Shuffle Right]" and the
diagram lines) to include a language identifier such as text (e.g., ```text) so
the linter recognizes the block correctly.
- Around line 67-76: The markdown table under the "### Key Components" heading
in SQLITE_COMPARISON.md must be surrounded by blank lines to satisfy
markdownlint MD058; edit the block that lists the components (the header "###
Key Components" and the following table rows including `BloomFilter`,
`BloomFilterArgs`, `ClusterManager`, `PushData`, `ShuffleFragment`, and
`Coordinator`) and add one blank line before the table and one blank line after
the table so the table is separated from surrounding text.

In `@tests/bloom_filter_test.cpp`:
- Around line 245-259: Update the test comment references in
BloomFilterTests::BloomFilterApplicationLogic to replace stale mentions of the
PushData handler with the correct sender-side shuffle path terminology (e.g.,
RpcType::ShuffleFragment) so comments match implementation; locate the comment
lines around the simulated tuple filtering and change "PushData" to
"ShuffleFragment" or "sender-side shuffle" and adjust any accompanying wording
to reflect that the bloom check happens on the sender shuffle path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9744677-040b-43c1-87f4-a3155c63413d

📥 Commits

Reviewing files that changed from the base of the PR and between bfd846c and 0e2b3f6.

📒 Files selected for processing (4)

docs/performance/SQLITE_COMPARISON.md
src/common/bloom_filter.cpp
src/main.cpp
tests/bloom_filter_test.cpp

🚧 Files skipped from review as they are similar to previous changes (2)

src/main.cpp
src/common/bloom_filter.cpp

coderabbitai · 2026-04-11T16:48:49Z

docs/performance/SQLITE_COMPARISON.md

+```
+[Phase 1: Shuffle Left]     [Phase 2: Shuffle Right]
+     |                             |
+     v                             v
+Build local bloom           Apply bloom filter
+from join keys              before buffering
+     |                             |
+     +---- BloomFilterPush ----->---+
+     (filter metadata)              |
+                                    v
+                         Filtered tuples buffered
+```


⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced diagram block.

The code fence is missing a language identifier, which triggers MD040.

Suggested patch

-``` +```text [Phase 1: Shuffle Left] [Phase 2: Shuffle Right] | | v v Build local bloom Apply bloom filter from join keys before buffering | | +---- BloomFilterPush ----->---+ (filter metadata) | v Filtered tuples buffered

</details> <details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.22.0)</summary> [warning] 54-54: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/performance/SQLITE_COMPARISON.md around lines 54 - 65, The fenced
diagram block lacks a language tag (MD040); update the code fence around the
ASCII diagram (the triple-backtick block containing "[Phase 1: Shuffle Left]
[Phase 2: Shuffle Right]" and the diagram lines) to include a language
identifier such as text (e.g., ```text) so the linter recognizes the block
correctly.

</details>  

coderabbitai · 2026-04-11T16:48:49Z

docs/performance/SQLITE_COMPARISON.md

+### Key Components
+| Component | Location | Purpose |
+|-----------|----------|---------|
+| `BloomFilter` class | `include/common/bloom_filter.hpp` | MurmurHash3-based bloom filter |
+| `BloomFilterArgs` RPC | `include/network/rpc_message.hpp` | Serialization for network transfer |
+| `ClusterManager` storage | `include/common/cluster_manager.hpp` | Stores bloom filter per context |
+| `PushData` handler | `src/main.cpp` | Receives and buffers filtered tuples |
+| `ShuffleFragment` handler | `src/main.cpp` | Applies bloom filter before sending |
+| Coordinator | `src/distributed/distributed_executor.cpp` | Broadcasts filter after Phase 1 |
+


⚠️ Potential issue | 🟡 Minor

Surround the table with blank lines to satisfy markdownlint.

Add a blank line before/after the table block (MD058).

Suggested patch

### Key Components + | Component | Location | Purpose | |-----------|----------|---------| | `BloomFilter` class | `include/common/bloom_filter.hpp` | MurmurHash3-based bloom filter | | `BloomFilterArgs` RPC | `include/network/rpc_message.hpp` | Serialization for network transfer | | `ClusterManager` storage | `include/common/cluster_manager.hpp` | Stores bloom filter per context | | `PushData` handler | `src/main.cpp` | Receives and buffers filtered tuples | | `ShuffleFragment` handler | `src/main.cpp` | Applies bloom filter before sending | | Coordinator | `src/distributed/distributed_executor.cpp` | Broadcasts filter after Phase 1 | + ### Test Coverage

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 68-68: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/performance/SQLITE_COMPARISON.md` around lines 67 - 76, The markdown table under the "### Key Components" heading in SQLITE_COMPARISON.md must be surrounded by blank lines to satisfy markdownlint MD058; edit the block that lists the components (the header "### Key Components" and the following table rows including `BloomFilter`, `BloomFilterArgs`, `ClusterManager`, `PushData`, `ShuffleFragment`, and `Coordinator`) and add one blank line before the table and one blank line after the table so the table is separated from surrounding text.

coderabbitai · 2026-04-11T16:48:49Z

tests/bloom_filter_test.cpp

+ * @brief Tests bloom filter application logic (simulates PushData handler behavior).
+ */
+TEST(BloomFilterTests, BloomFilterApplicationLogic) {
+    // Build bloom filter with known keys
+    BloomFilter bf(100);
+    bf.insert(Value::make_int64(10));
+    bf.insert(Value::make_int64(20));
+    bf.insert(Value::make_int64(30));
+
+    // Verify no-false-negative: inserted values must be found via might_contain
+    EXPECT_TRUE(bf.might_contain(Value::make_int64(10)));
+    EXPECT_TRUE(bf.might_contain(Value::make_int64(20)));
+    EXPECT_TRUE(bf.might_contain(Value::make_int64(30)));
+
+    // Simulate tuple filtering (as done in PushData handler)


⚠️ Potential issue | 🟡 Minor

Update stale handler naming in test comments.

These comments still reference PushData, but the bloom check is documented/implemented on the sender-side shuffle path (RpcType::ShuffleFragment). Rename the comments to avoid drift.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/bloom_filter_test.cpp` around lines 245 - 259, Update the test comment references in BloomFilterTests::BloomFilterApplicationLogic to replace stale mentions of the PushData handler with the correct sender-side shuffle path terminology (e.g., RpcType::ShuffleFragment) so comments match implementation; locate the comment lines around the simulated tuple filtering and change "PushData" to "ShuffleFragment" or "sender-side shuffle" and adjust any accompanying wording to reflect that the bloom check happens on the sender shuffle path.

poyrazK added 4 commits April 11, 2026 17:39

style: automated clang-format fixes

5154dde

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

poyrazK and others added 3 commits April 11, 2026 18:36

style: automated clang-format fixes

bfd846c

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

docs/performance/SQLITE_COMPARISON.md Show resolved Hide resolved

docs/performance/SQLITE_COMPARISON.md Outdated Show resolved Hide resolved

src/common/bloom_filter.cpp Show resolved Hide resolved

poyrazK and others added 2 commits April 11, 2026 19:42

fix: address review feedback - bloom filter validation, hoisting, docs

7904cae

style: automated clang-format fixes

0e2b3f6

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

Conversation

poyrazK commented Apr 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poyrazK commented Apr 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 11, 2026 •

edited

Loading