Phase 4: Shard-Aware Routing and Hardened 2PC by poyrazK · Pull Request #4 · poyrazK/cloudSQL

poyrazK · 2026-03-01T12:22:54Z

This PR implements the Phase 4 distributed operations:

Shard-Aware Routing: Point-query analysis and deterministic hashing for INSERTs.
Two-Phase Commit (2PC): Parallel orchestration of Prepare and Commit/Abort phases.
Fault Tolerance: Hardened with Raft election and 2PC failure simulation tests.
Infrastructure: Internal binary RPC protocol and ClusterManager for node discovery.
Testing: Increased coverage for distributed components using GMock and simulation suites.

Summary by CodeRabbit

New Features
- Multi-node distributed cluster support with Raft consensus, cluster manager, and inter-node RPCs
- Distributed query execution with automatic routing, data sharding, and two‑phase commit for transactions
- New run modes and CLI options: Standalone, Coordinator, Data, cluster-port, and seed nodes
Tests
- Added distributed execution, 2PC transaction, and Raft simulation test suites

…nation

coderabbitai · 2026-03-01T12:23:14Z

📝 Walkthrough

Walkthrough

Adds distributed functionality: Raft consensus node, RPC client/server, cluster manager, shard routing, distributed executor with 2PC, and related Catalog/Transaction changes plus tests and CMake test target updates.

Changes

Cohort / File(s)	Summary
Build `CMakeLists.txt`	Added gmock linking and new test targets: `raft_tests`, `distributed_tests`, `raft_sim_tests`, `distributed_txn_tests`.
Catalog & Sharding `include/catalog/catalog.hpp`, `src/catalog/catalog.cpp`, `include/distributed/shard_manager.hpp`	Introduced `ShardInfo`, added `TableInfo::shards`, local catalog ops (`create_table_local`, `drop_table_local`), Raft node pointer/setter, default shard assignment, and shard lookup/hash helpers.
Cluster & Config `include/common/cluster_manager.hpp`, `include/common/config.hpp`, `src/common/config.cpp`	Added `ClusterManager` for node registration/heartbeat/queries; expanded `RunMode` (Standalone/Coordinator/Data), `cluster_port`, and `seed_nodes` config handling.
Raft Types & Node `include/distributed/raft_types.hpp`, `include/distributed/raft_node.hpp`, `src/distributed/raft_node.cpp`	Defined Raft types (NodeState, LogEntry, RPC args/replies, persistent/volatile state) and implemented `RaftNode` with lifecycle, state machine, RPC handlers, and basic persist hooks.
RPC Protocol & I/O `include/network/rpc_message.hpp`, `include/network/rpc_client.hpp`, `src/network/rpc_client.cpp`, `include/network/rpc_server.hpp`, `src/network/rpc_server.cpp`	Added binary RPC framing (`RpcHeader`, `RpcType`), payload structs (ExecuteFragmentArgs, QueryResultsReply, TxnOperationArgs), thread-safe socket `RpcClient`, and multi-threaded `RpcServer` with handler registration.
Distributed Execution `include/distributed/distributed_executor.hpp`, `src/distributed/distributed_executor.cpp`	New `DistributedExecutor` coordinating DDL routing, shard-based routing for queries, and 2PC (prepare/commit/abort) with parallel RPCs and result aggregation.
Server Integration & Main `include/network/server.hpp`, `src/network/server.cpp`, `src/main.cpp`	Extended Server API to accept `Config` and `ClusterManager*`; integrated cluster, RPC, and Raft setup in `main`; server delegates to `DistributedExecutor` when coordinator.
Transactions & Recovery `include/transaction/transaction.hpp`, `include/transaction/transaction_manager.hpp`, `src/transaction/transaction_manager.cpp`, `include/recovery/log_record.hpp`	Added `PREPARED` transaction state and `prepare()` API; transaction manager logs PREPARE, handles state transitions and safe active->completed moves; added `PREPARE` log record type.
Tests `tests/raft_tests.cpp`, `tests/raft_simulation_tests.cpp`, `tests/distributed_tests.cpp`, `tests/distributed_txn_tests.cpp`, `tests/server_tests.cpp`	Added Raft and distributed execution/2PC tests; updated server tests to new constructor/signature usage.

Sequence Diagrams

sequenceDiagram
    participant Client
    participant Coordinator as Coordinator Server
    participant DistExec as DistributedExecutor
    participant DataNode1 as Data Node 1
    participant DataNode2 as Data Node 2

    Client->>Coordinator: COMMIT statement
    Coordinator->>DistExec: execute(COMMIT)
    Note over DistExec: Phase 1 - Prepare
    DistExec->>DataNode1: TxnPrepare RPC
    DistExec->>DataNode2: TxnPrepare RPC
    DataNode1-->>DistExec: PREPARED
    DataNode2-->>DistExec: PREPARED

    alt All prepared
        Note over DistExec: Phase 2 - Commit
        DistExec->>DataNode1: TxnCommit RPC
        DistExec->>DataNode2: TxnCommit RPC
        DataNode1-->>DistExec: COMMITTED
        DataNode2-->>DistExec: COMMITTED
        DistExec-->>Coordinator: success
    else Prepare failed
        Note over DistExec: Phase 2 - Abort
        DistExec->>DataNode1: TxnAbort RPC
        DistExec->>DataNode2: TxnAbort RPC
        DataNode1-->>DistExec: ABORTED
        DataNode2-->>DistExec: ABORTED
        DistExec-->>Coordinator: error
    end
    Coordinator-->>Client: result

sequenceDiagram
    participant Follower as RaftNode (candidate/follower)
    participant Peer1 as RaftNode 2
    participant Peer2 as RaftNode 3

    Note over Follower: Election timeout
    Follower->>Follower: increment term, vote self
    Follower->>Peer1: RequestVote RPC
    Follower->>Peer2: RequestVote RPC
    Peer1-->>Follower: vote granted
    Peer2-->>Follower: vote granted
    Note over Follower: majority -> become Leader
    loop Heartbeat interval (Leader)
        Follower->>Peer1: AppendEntries (heartbeat)
        Follower->>Peer2: AppendEntries (heartbeat)
        Peer1-->>Follower: ACK
        Peer2-->>Follower: ACK
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

chore: migrate custom testing framework to GoogleTest #2: Updates CMake test infrastructure and adds/adjusts GoogleTest targets (matches this PR's test target and linking changes).
chore: migrate to GoogleTest and restore coverage #3: Modifies TransactionManager and transaction lifecycle APIs (related to added prepare()/PREPARED state and transaction lifecycle changes).

Poem

🐰 I hopped through logs and RPC calls bright,

Raft hearts beat steady through day and night,
Prepare, commit—two-phase dance so neat,
Shards find their homes and data nodes meet,
A rabbit cheers: distribution's a tasty treat! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Phase 4: Shard-Aware Routing and Hardened 2PC' accurately summarizes the main changes: shard-aware routing logic and two-phase commit improvements with fault tolerance enhancements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/phase4-final-split

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

include/recovery/log_record.hpp (1)
26-37: ⚠️ Potential issue | 🟠 Major

Prevent WAL deserialization corruption from enum reordering.

Adding PREPARE between BEGIN and COMMIT shifts the numeric values of COMMIT (7→8), ABORT (8→9), and NEW_PAGE (9→10). Since serialization uses raw memcpy without versioning, old WAL logs with byte value 7 will deserialize as PREPARE instead of COMMIT during recovery, breaking correctness.

Assign explicit numeric IDs to all enum members to keep values stable across additions:
Required fix
 enum class LogRecordType : uint8_t {
     INVALID = 0,
     INSERT = 1,
     MARK_DELETE = 2,
     APPLY_DELETE = 3,
     ROLLBACK_DELETE = 4,
     UPDATE = 5,
     BEGIN = 6,
     COMMIT = 7,
     ABORT = 8,
     NEW_PAGE = 9,
     PREPARE = 10
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/recovery/log_record.hpp` around lines 26 - 37, The LogRecordType enum
must use stable explicit numeric values to avoid WAL deserialization corruption
when new members are inserted; update the enum LogRecordType by assigning fixed
integer values to every member (e.g., INVALID = 0, INSERT = 1, MARK_DELETE = 2,
..., NEW_PAGE = N) so existing on-disk bytes map to the intended meanings, and
ensure any code that serializes/deserializes LogRecordType (uses memcpy or reads
raw bytes) continues to operate on these explicit values; keep the same symbolic
names (INSERT, MARK_DELETE, APPLY_DELETE, ROLLBACK_DELETE, UPDATE, BEGIN,
PREPARE, COMMIT, ABORT, NEW_PAGE) but give each an explicit numeric initializer
to lock their numeric IDs.

🟠 Major comments (14)

include/common/cluster_manager.hpp-59-64 (1)

59-64: ⚠️ Potential issue | 🟠 Major

Node liveness never expires, so stale nodes stay routable.

last_heartbeat is tracked, but Lines 70-92 only check is_active, and this flag is only set to true in heartbeat() (Lines 62-63). Without timeout-based deactivation, failed nodes remain eligible for routing and 2PC fanout indefinitely.

Also applies to: 70-92

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/common/cluster_manager.hpp` around lines 59 - 64, The code sets
nodes_[id].is_active = true in heartbeat() but never expires nodes so stale
nodes remain routable; update the liveness logic to use last_heartbeat as the
source of truth: introduce a configurable timeout (e.g., heartbeat_timeout) and,
wherever node liveness is checked (the code paths referencing is_active between
the checks around lines 70-92), replace or augment the is_active check with a
time-based check that compares std::chrono::system_clock::now() -
nodes_[id].last_heartbeat > heartbeat_timeout and sets nodes_[id].is_active =
false (under mutex_), or simply treat node as inactive when the delta exceeds
the timeout; ensure modifications reference heartbeat(), nodes_, last_heartbeat,
is_active and mutex_ so stale nodes are deactivated promptly.

include/common/cluster_manager.hpp-36-44 (1)

36-44: ⚠️ Potential issue | 🟠 Major

Constructor intent and behavior diverge: self node is never registered.

Line 37 says “Add self to node map”, but nodes_ is never updated in this branch. As a result, get_data_nodes() / get_coordinators() won’t include the local node unless another path registers it.

🔧 Proposed fix

     explicit ClusterManager(const config::Config* config) : config_(config) {
         // Add self to node map if in distributed mode
         if (config_ != nullptr && config_->mode != config::RunMode::Standalone) {
             self_node_.id = "local_node";  // Will be replaced by unique ID later
             self_node_.address = "127.0.0.1";
             self_node_.cluster_port = config_->cluster_port;
             self_node_.role = config_->mode;
             self_node_.last_heartbeat = std::chrono::system_clock::now();
+            nodes_.emplace(self_node_.id, self_node_);
         }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/common/cluster_manager.hpp` around lines 36 - 44, The constructor for
ClusterManager sets up self_node_ when not in Standalone mode but never inserts
it into the nodes_ map, so local node lookups (e.g., get_data_nodes(),
get_coordinators()) miss it; fix by adding self_node_ to nodes_ after it's
populated (use nodes_.emplace(self_node_.id, self_node_) or
nodes_[self_node_.id] = self_node_) inside the same branch of explicit
ClusterManager(const config::Config* config) so the local node is registered
immediately (still leave the placeholder id "local_node" to be replaced later).

include/catalog/catalog.hpp-181-195 (1)

181-195: 🛠️ Refactor suggestion | 🟠 Major

Restrict local-only DDL APIs to prevent accidental consensus bypass.

create_table_local / drop_table_local are public at Lines 184 and 194. That exposes direct local mutation paths that can bypass Raft apply flow and risk catalog divergence.

🔧 Proposed hardening

 class Catalog {
    public:
@@
-    oid_t create_table_local(const std::string& table_name, std::vector<ColumnInfo> columns);
@@
-    bool drop_table_local(oid_t table_id);
@@
    private:
+    friend class raft::RaftNode;
+    oid_t create_table_local(const std::string& table_name, std::vector<ColumnInfo> columns);
+    bool drop_table_local(oid_t table_id);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/catalog/catalog.hpp` around lines 181 - 195, The public API exposes
local-only DDL functions create_table_local and drop_table_local which can be
accidentally invoked and bypass Raft; change their visibility to non-public by
moving create_table_local and drop_table_local into the class's private (or
protected) section in catalog.hpp (or mark them as internal with an access
specifier) and add a clear comment above each explaining they are Raft-only
internals; ensure any callers outside the class are updated to use the public
create_table/drop_table RPC paths or friend accessors so no external code
depends on these methods being public.

src/common/config.cpp-56-57 (1)

56-57: ⚠️ Potential issue | 🟠 Major

cluster_port and port parsing accept out-of-range values via wraparound.

At lines 55 and 57, casting std::stoi(value) to uint16_t before bounds validation causes invalid inputs (e.g., 70000) to silently wrap to valid-looking values (e.g., 4464). The validation at line 135 checking cluster_port > MAX_PORT is ineffective because cluster_port is already uint16_t, so this condition can never be true. Validate as int before casting to uint16_t, and reject values outside the valid port range (1–65535).

🔧 Proposed fix

        if (key == "port") {
-           port = static_cast<uint16_t>(std::stoi(value));
+           int parsed_port = 0;
+           try {
+               parsed_port = std::stoi(value);
+           } catch (const std::exception&) {
+               std::cerr << "Invalid port: " << value << "\n";
+               return false;
+           }
+           if (parsed_port <= 0 ||
+               parsed_port > static_cast<int>(MAX_PORT)) {
+               std::cerr << "Invalid port: " << value << "\n";
+               return false;
+           }
+           port = static_cast<uint16_t>(parsed_port);
        } else if (key == "cluster_port") {
-           cluster_port = static_cast<uint16_t>(std::stoi(value));
+           int parsed_cluster_port = 0;
+           try {
+               parsed_cluster_port = std::stoi(value);
+           } catch (const std::exception&) {
+               std::cerr << "Invalid cluster_port: " << value << "\n";
+               return false;
+           }
+           if (parsed_cluster_port <= 0 ||
+               parsed_cluster_port > static_cast<int>(MAX_PORT)) {
+               std::cerr << "Invalid cluster_port: " << value << "\n";
+               return false;
+           }
+           cluster_port = static_cast<uint16_t>(parsed_cluster_port);
         }
@@
     if (port == 0 || port > MAX_PORT) {
-       std::cerr << "Invalid port number: " << port << "\n";
-       return false;
-   }
-
-   if (cluster_port == 0 || cluster_port > MAX_PORT) {
         std::cerr << "Invalid port number: " << port << "\n";
         return false;
     }
+
+   if (cluster_port == 0) {
+       std::cerr << "Invalid cluster port number: " << cluster_port << "\n";
+       return false;
+   }

Also affects: line 55 (port parsing)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/common/config.cpp` around lines 56 - 57, The port parsing for both port
and cluster_port currently casts std::stoi(value) directly to uint16_t (symbols:
port, cluster_port), allowing out-of-range inputs to wrap; change parsing to
read into a signed integer first (e.g., int parsed = std::stoi(value)), validate
parsed is within 1..65535 (compare against MIN_PORT/ MAX_PORT or literal bounds)
and only then assign static_cast<uint16_t>(parsed); update or remove the
existing ineffective check that compares cluster_port (uint16_t) to MAX_PORT so
validation occurs on the int before casting.

include/distributed/shard_manager.hpp-27-27 (1)

27-27: ⚠️ Potential issue | 🟠 Major

Fail fast when shard count is zero.

Line 27 returning shard 0 for num_shards == 0 hides invalid topology and can silently misroute operations.

Proposed fix

-        if (num_shards == 0) return 0;
+        if (num_shards == 0) {
+            throw std::invalid_argument("num_shards must be > 0");
+        }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/distributed/shard_manager.hpp` at line 27, The function currently
returns 0 when num_shards == 0 which hides an invalid topology; instead detect
the zero-shard condition (the if (num_shards == 0) branch) and fail fast by
raising a clear error (e.g., throw std::invalid_argument or std::runtime_error
with a descriptive message) or using an assert, so callers do not silently get
shard 0; update the branch that currently returns 0 to throw/include the chosen
failure mechanism and ensure callers that call this function/documentation
handle the exception.

src/network/rpc_client.cpp-84-84 (1)

84-84: ⚠️ Potential issue | 🟠 Major

Guard against payload length truncation in RPC header.

Line 84 casts payload.size() to uint16_t without validation. When payloads exceed 65535 bytes, the header stores a truncated value while the full payload is sent on line 93. The receiver (in call()) reads this truncated size from the header and only receives that many bytes, leaving excess data in the stream and desynchronizing all subsequent message framing.
Proposed fix
+#include <limits>
...
-    header.payload_len = static_cast<uint16_t>(payload.size());
+    if (payload.size() > std::numeric_limits<uint16_t>::max()) {
+        return false;
+    }
+    header.payload_len = static_cast<uint16_t>(payload.size());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/network/rpc_client.cpp` at line 84, The header.payload_len is being
assigned via static_cast<uint16_t>(payload.size()) which truncates lengths >
65535 and desynchronizes framing; update the sending code that sets
header.payload_len (the statement assigning header.payload_len from
payload.size()) to first validate payload.size() <=
std::numeric_limits<uint16_t>::max() and handle violations by returning an
error/throwing an exception or rejecting the RPC request with a clear error
message; also ensure callers (e.g., the function that sends the header/payload
and the receiver call() that reads header.payload_len) are left consistent by
not sending the payload when the size check fails.

src/distributed/distributed_executor.cpp-47-49 (1)

47-49: ⚠️ Potential issue | 🟠 Major

Hard-coded global transaction ID will collide across concurrent sessions.

Using GLOBAL_TXN_ID = 1 causes cross-session transaction aliasing and breaks 2PC isolation/correctness under concurrent clients.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/distributed/distributed_executor.cpp` around lines 47 - 49, The
hard-coded constexpr GLOBAL_TXN_ID = 1 will collide across concurrent sessions;
replace this static constant with a thread-safe, per-session transaction ID
generator (e.g., a std::atomic<uint64_t> global counter or a UUID-based
generator) and assign a unique txn id when creating the executor/transaction
(look for uses in distributed_executor.cpp and the
DistributedExecutor/TransactionManager constructors). Ensure the chosen
generator is initialized once and accessed atomically (or use a session-specific
counter combined with a unique session id) so each DistributedExecutor instance
gets a unique txn id instead of reusing GLOBAL_TXN_ID.

src/catalog/catalog.cpp-76-82 (1)

76-82: ⚠️ Potential issue | 🟠 Major

Raft-enabled DDL path currently mutates local state without replication.

When raft_node_ is set, both methods still call local mutation directly. In multi-node mode this can diverge catalog metadata across nodes.

💡 Safer interim behavior (fail fast until Raft replication is wired)

 oid_t Catalog::create_table(const std::string& table_name, std::vector<ColumnInfo> columns) {
     if (raft_node_ != nullptr) {
-        /* TODO: Serialize DDL and replicate via Raft */
-        /* For now, just call local to keep it working during Step 4 implementation */
-        return create_table_local(table_name, std::move(columns));
+        throw std::runtime_error("create_table via Raft is not implemented");
     }
     return create_table_local(table_name, std::move(columns));
 }
@@
 bool Catalog::drop_table(oid_t table_id) {
     if (raft_node_ != nullptr) {
-        /* TODO: Replicate via Raft */
-        return drop_table_local(table_id);
+        throw std::runtime_error("drop_table via Raft is not implemented");
     }
     return drop_table_local(table_id);
 }

Also applies to: 112-117

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/catalog/catalog.cpp` around lines 76 - 82, When raft_node_ is non-null
the code currently calls create_table_local (mutating local state without Raft
replication); change the branch in the create_table path so that when raft_node_
!= nullptr it fails fast with a clear error/Status (e.g., return an error result
or throw an exception indicating "DDL replication via Raft not implemented")
instead of calling create_table_local, and apply the same change pattern to the
similar DDL paths (the other branch around lines 112-117) so no local mutation
occurs when raft_node_ is set.

src/distributed/distributed_executor.cpp-88-105 (1)

88-105: ⚠️ Potential issue | 🟠 Major

Phase-2 commit failures are ignored, but success is still returned.

client.call(...) and reply status in phase 2 are discarded. A node-level commit failure can still produce an overall success response.

🧭 Suggested result aggregation for phase 2

-        std::vector<std::future<void>> phase2_futures;
+        std::vector<std::future<std::pair<bool, std::string>>> phase2_futures;
         for (const auto& node : data_nodes) {
-            phase2_futures.push_back(std::async(std::launch::async, [node, payload, phase2_type]() {
+            phase2_futures.push_back(std::async(std::launch::async, [node, payload, phase2_type]() {
                 network::RpcClient client(node.address, node.cluster_port);
                 if (client.connect()) {
                     std::vector<uint8_t> resp_payload;
-                    static_cast<void>(client.call(phase2_type, payload, resp_payload));
+                    if (!client.call(phase2_type, payload, resp_payload)) {
+                        return std::make_pair(false, "[" + node.id + "] phase2 RPC failed");
+                    }
+                    auto reply = network::QueryResultsReply::deserialize(resp_payload);
+                    if (!reply.success) {
+                        return std::make_pair(false, "[" + node.id + "] phase2 failed: " + reply.error_msg);
+                    }
+                    return std::make_pair(true, std::string{});
                 }
+                return std::make_pair(false, "[" + node.id + "] phase2 connect failed");
             }));
         }
-        for (auto& f : phase2_futures) f.get();
+        bool phase2_ok = true;
+        for (auto& f : phase2_futures) {
+            auto [ok, err] = f.get();
+            if (!ok) { phase2_ok = false; errors += err + "; "; }
+        }
 
-        if (all_prepared) {
+        if (all_prepared && phase2_ok) {
             return QueryResult();
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/distributed/distributed_executor.cpp` around lines 88 - 105, The phase-2
async tasks currently discard client.call(...) results and reply status,
allowing node-level commit failures to be ignored; modify the lambda launched in
phase2_futures to return a result (e.g., bool success and optional error string)
instead of void, have it check the return value of
network::RpcClient::call(phase2_type, payload, resp_payload) and parse
resp_payload/status for errors, and propagate any failures (collect into a
vector of failures or combine into errors) when you .get() the futures; finally,
if any phase-2 call failed, set QueryResult res with an appropriate "Distributed
transaction aborted" message (using the collected error details) rather than
returning success, otherwise return success as before (use identifiers:
phase2_futures, data_nodes, network::RpcClient, client.call, phase2_type,
payload, resp_payload, all_prepared, errors, QueryResult).

include/network/rpc_client.hpp-27-27 (1)

27-27: ⚠️ Potential issue | 🟠 Major

is_connected() reads shared state without synchronization.

fd_ is mutable connection state; reading it lock-free can race with connect() / disconnect() updates.

🔒 Suggested thread-safe accessor

-    bool is_connected() const { return fd_ >= 0; }
+    bool is_connected() const {
+        std::scoped_lock<std::mutex> lock(mutex_);
+        return fd_ >= 0;
+    }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/network/rpc_client.hpp` at line 27, is_connected() currently reads
the mutable connection field fd_ without synchronization and can race with
connect()/disconnect() updates; modify is_connected() to acquire the same lock
used to protect fd_ (e.g., the connection mutex or std::mutex guarding fd_) and
read fd_ under that lock, or make fd_ an atomic<int> and use atomic load
semantics if that fits your concurrency model; ensure the accessor uses the
chosen synchronization primitive consistently with connect() and disconnect() to
avoid data races.

src/distributed/raft_node.cpp-137-140 (1)

137-140: ⚠️ Potential issue | 🟠 Major

step_down() called without holding mutex in do_candidate().

On line 138, step_down(resp_term) is called without holding mutex_. However, step_down() modifies persistent_state_ and state_, which are also accessed under lock elsewhere. This creates a potential data race.

Compare with line 204 in handle_request_vote() where step_down() is correctly called with the lock held.
Proposed fix to acquire lock before step_down
                     if (resp_term > args.term) {
+                        const std::scoped_lock<std::mutex> lock(mutex_);
                         step_down(resp_term);
                         return;
                     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/distributed/raft_node.cpp` around lines 137 - 140, In do_candidate(),
calling step_down(resp_term) without holding mutex_ can cause races because
step_down() mutates persistent_state_ and state_; change the code in
do_candidate() so that you acquire the same lock used elsewhere (mutex_) before
invoking step_down(resp_term) — e.g., wrap the step_down call with the scoped
lock used in handle_request_vote() (the same mutex_ and locking pattern) so
state updates happen under the lock and no other thread can concurrently access
persistent_state_ or state_.

include/network/rpc_message.hpp-144-160 (1)

144-160: ⚠️ Potential issue | 🟠 Major

TxnOperationArgs also uses host byte order for txn_id.

Similar to QueryResultsReply, the txn_id is serialized in host byte order. For consistency with RpcHeader and cross-platform compatibility, use network byte order.

Proposed fix using manual byte order conversion

 struct TxnOperationArgs {
     uint64_t txn_id = 0;

     [[nodiscard]] std::vector<uint8_t> serialize() const {
         std::vector<uint8_t> out(8);
-        std::memcpy(out.data(), &txn_id, 8);
+        // Manual conversion to network byte order (big-endian)
+        for (int i = 0; i < 8; ++i) {
+            out[7 - i] = static_cast<uint8_t>((txn_id >> (i * 8)) & 0xFF);
+        }
         return out;
     }

     static TxnOperationArgs deserialize(const std::vector<uint8_t>& in) {
         TxnOperationArgs args;
         if (in.size() >= 8) {
-            std::memcpy(&args.txn_id, in.data(), 8);
+            // Manual conversion from network byte order (big-endian)
+            args.txn_id = 0;
+            for (int i = 0; i < 8; ++i) {
+                args.txn_id = (args.txn_id << 8) | in[i];
+            }
         }
         return args;
     }
 };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/network/rpc_message.hpp` around lines 144 - 160, TxnOperationArgs
currently serializes txn_id in host byte order; update
TxnOperationArgs::serialize and TxnOperationArgs::deserialize to convert txn_id
to/from network byte order (big-endian) before copying to/from the byte vector.
In serialize(), convert txn_id to a 64-bit network-order value (use a portable
ntoh/hton64 helper or explicit byte-shifts) and write those bytes into the
output vector; in deserialize(), read the 8 bytes into a 64-bit network-order
value and convert it back to host order before assigning args.txn_id. Ensure the
helper (e.g., hton64/ntoh64) is used consistently so TxnOperationArgs matches
RpcHeader/QueryResultsReply semantics.

include/network/rpc_message.hpp-97-117 (1)

97-117: ⚠️ Potential issue | 🟠 Major

Byte order inconsistency: QueryResultsReply uses host byte order.

The RpcHeader correctly uses htonl/ntohl for network byte order, but QueryResultsReply::serialize() writes err_len and row_count directly via memcpy in host byte order. This will cause protocol errors when communicating between machines with different endianness.

Proposed fix to use network byte order

     [[nodiscard]] std::vector<uint8_t> serialize() const {
         std::vector<uint8_t> out;
         out.push_back(success ? 1 : 0);
         
-        uint32_t err_len = static_cast<uint32_t>(error_msg.size());
+        uint32_t err_len = htonl(static_cast<uint32_t>(error_msg.size()));
         size_t offset = out.size();
         out.resize(offset + 4 + err_len);
-        std::memcpy(out.data() + offset, &err_len, 4);
-        std::memcpy(out.data() + offset + 4, error_msg.data(), err_len);
+        std::memcpy(out.data() + offset, &err_len, 4);  // Already in network order
+        uint32_t raw_err_len = static_cast<uint32_t>(error_msg.size());
+        std::memcpy(out.data() + offset + 4, error_msg.data(), raw_err_len);
         
         // Simplified row count serialization
-        uint32_t row_count = static_cast<uint32_t>(rows.size());
+        uint32_t row_count = htonl(static_cast<uint32_t>(rows.size()));
         offset = out.size();
         out.resize(offset + 4);
         std::memcpy(out.data() + offset, &row_count, 4);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/network/rpc_message.hpp` around lines 97 - 117,
QueryResultsReply::serialize() writes err_len and row_count in host byte order
causing endianness bugs; convert those uint32_t values to network byte order
with htonl before memcpy (i.e., call htonl(err_len) and htonl(row_count) and
memcpy the converted values), and ensure the corresponding reader
(QueryResultsReply::deserialize or the code that parses these fields) uses ntohl
when loading them back; update the places referencing err_len, row_count,
error_msg, rows and follow the RpcHeader pattern of using htonl/ntohl for
consistency.

src/distributed/raft_node.cpp-83-96 (1)

83-96: ⚠️ Potential issue | 🟠 Major

Double random timeout generation in do_follower creates inconsistent behavior.

Line 84 generates a timeout, then line 88 inside the predicate generates a different random timeout for comparison. This means the condition compares elapsed time against a different value than the wait duration, potentially causing premature or delayed election timeouts.

Proposed fix to use consistent timeout

 void RaftNode::do_follower() {
     const auto timeout = get_random_timeout();
     std::unique_lock<std::mutex> lock(mutex_);
-    if (cv_.wait_for(lock, timeout, [this] {
+    const auto deadline = std::chrono::system_clock::now() + timeout;
+    if (cv_.wait_for(lock, timeout, [this, deadline] {
             return !running_ ||
-                   (std::chrono::system_clock::now() - last_heartbeat_ > get_random_timeout());
+                   (std::chrono::system_clock::now() > deadline);
         })) {
         if (!running_) {
             return;
         }
         // Election timeout reached, become candidate
         state_ = NodeState::Candidate;
     }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/distributed/raft_node.cpp` around lines 83 - 96, The follower loop
computes a random timeout twice (once in timeout and again inside the wait
predicate) causing inconsistent behavior; in RaftNode::do_follower you should
compute the random timeout once and reuse it in the cv_.wait_for predicate
(e.g., capture the timeout by value or compute a deadline = last_heartbeat_ +
timeout and compare now > deadline) so the duration passed to wait_for and the
elapsed-time check against last_heartbeat_ use the same value; update the lambda
passed to cv_.wait_for to reference that single timeout/deadline and leave
remaining logic (running_, state_ transition to Candidate) unchanged.

🟡 Minor comments (8)

include/catalog/catalog.hpp-75-79 (1)

75-79: ⚠️ Potential issue | 🟡 Minor

Default-initialize ShardInfo scalar fields.

Line 76 (shard_id) and Line 78 (port) are currently uninitialized on default construction. Initializing them to zero avoids accidental undefined reads in metadata paths.

🔧 Proposed fix

 struct ShardInfo {
-    uint32_t shard_id;
+    uint32_t shard_id = 0;
     std::string node_address;
-    uint16_t port;
+    uint16_t port = 0;
 };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/catalog/catalog.hpp` around lines 75 - 79, ShardInfo's scalar members
shard_id and port are left uninitialized which can cause undefined reads; update
the ShardInfo definition so shard_id and port are default-initialized (e.g., add
in-class member initializers or a default constructor that sets shard_id = 0 and
port = 0) while leaving node_address as-is, ensuring any default-constructed
ShardInfo has deterministic zero values for those fields.

tests/raft_tests.cpp-21-26 (1)

21-26: ⚠️ Potential issue | 🟡 Minor

Avoid fixed port in tests to reduce flakiness.

Lines 21 and 25 hardcode port 6000; this can fail non-deterministically when the port is in use.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/raft_tests.cpp` around lines 21 - 26, The test currently hardcodes
TEST_PORT=6000 and uses it for config.cluster_port and constructing
network::RpcServer, causing flakes when the port is occupied; change the test to
allocate an ephemeral port instead (e.g., bind the server socket to port 0 or
call a small helper that finds a free port), retrieve the assigned port, set
config.cluster_port to that assigned port, and then construct
cluster::ClusterManager and network::RpcServer with the dynamically assigned
port (update references to TEST_PORT, config.cluster_port, cm and rpc
accordingly).

tests/raft_simulation_tests.cpp-24-24 (1)

24-24: ⚠️ Potential issue | 🟡 Minor

Use dynamic test ports to avoid collisions.

Lines 24 and 44 hardcode ports (7000, 7001), which can cause sporadic failures in CI or parallel runs.

Also applies to: 44-44

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/raft_simulation_tests.cpp` at line 24, Tests hardcode fixed ports when
constructing the RPC servers (network::RpcServer rpc(7000) and the similar
instantiation at the other site), which causes port collisions in parallel/CI
runs; change the tests to allocate dynamic ports by creating the RpcServer with
port 0 (or use a helper that finds a free port) so the OS assigns an ephemeral
port, then query the server for its actual bound port and use that value when
composing peer addresses for other nodes; update both occurrences that pass
literal 7000/7001 and ensure any follow-up setup (peer lists, client
connections) uses the retrieved bound port from network::RpcServer instead of
the hardcoded constants.

tests/distributed_tests.cpp-43-46 (1)

43-46: ⚠️ Potential issue | 🟡 Minor

Guard parser output before dereference.

Line 45 dereferences stmt without asserting parse success first; a parse failure becomes a crash instead of a clear test failure.

Proposed fix

     auto stmt = parser.parse_statement();
+    ASSERT_NE(stmt, nullptr);
     
     auto res = exec.execute(*stmt, "CREATE TABLE test (id INT)");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/distributed_tests.cpp` around lines 43 - 46, The test currently
dereferences the result of parser.parse_statement() (stmt) without checking it;
change the test to assert the parse succeeded before using *stmt (e.g., use
ASSERT_TRUE(stmt) or ASSERT_NE(stmt, nullptr) immediately after auto stmt =
parser.parse_statement()) so the test fails cleanly on parse errors rather than
crashing, then call exec.execute(*stmt, "CREATE TABLE test (id INT)") as before.

src/main.cpp-304-308 (1)

304-308: ⚠️ Potential issue | 🟡 Minor

2PC commit/abort returns success even when transaction is not found.

In both TxnCommit (lines 304-308) and TxnAbort (lines 332-336), if get_transaction(args.txn_id) returns null, reply.success is still set to true. This silently masks cases where the coordinator believes a commit/abort succeeded, but the data node had no record of the transaction.

For 2PC correctness, consider returning an error or at least logging a warning when the transaction is not found.

Proposed fix to handle missing transactions

                         try {
                             auto txn = transaction_manager.get_transaction(args.txn_id);
                             if (txn) {
                                 transaction_manager.commit(txn);
+                                reply.success = true;
+                            } else {
+                                // Transaction not found - may have already been cleaned up
+                                reply.success = true;  // Idempotent: treat as already committed
+                                // Consider logging: std::cerr << "TxnCommit: txn not found\n";
                             }
-                            reply.success = true;
                         } catch (const std::exception& e) {

Also applies to: 332-336

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.cpp` around lines 304 - 308, In TxnCommit and TxnAbort, when
transaction_manager.get_transaction(args.txn_id) returns null you currently set
reply.success = true; change this so a missing txn is treated as an error: check
the returned txn pointer, and if null set reply.success = false (or set an error
code/message) and log a warning including args.txn_id before returning; only
call transaction_manager.commit(txn) or transaction_manager.abort(txn) and set
reply.success = true when txn is non-null. Ensure the same handling and logging
is applied in both TxnCommit and TxnAbort.

src/distributed/raft_node.cpp-40-55 (1)

40-55: ⚠️ Potential issue | 🟡 Minor

Race condition: handlers registered after thread starts.

The raft_thread_ is started on line 42, but RPC handlers are registered on lines 45-54. If an RPC message arrives in this window, it won't be handled. Register handlers before starting the thread.

Proposed fix to register handlers first

 void RaftNode::start() {
-    running_ = true;
-    raft_thread_ = std::thread(&RaftNode::run_loop, this);
-
     // Register handlers
     rpc_server_.set_handler(
         network::RpcType::RequestVote,
         [this](const network::RpcHeader& h, const std::vector<uint8_t>& p, int fd) {
             handle_request_vote(h, p, fd);
         });
     rpc_server_.set_handler(
         network::RpcType::AppendEntries,
         [this](const network::RpcHeader& h, const std::vector<uint8_t>& p, int fd) {
             handle_append_entries(h, p, fd);
         });
+
+    running_ = true;
+    raft_thread_ = std::thread(&RaftNode::run_loop, this);
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/distributed/raft_node.cpp` around lines 40 - 55, Move RPC handler
registration in RaftNode::start so rpc_server_.set_handler calls for
network::RpcType::RequestVote (invoking handle_request_vote) and
network::RpcType::AppendEntries (invoking handle_append_entries) occur before
starting raft_thread_ and setting running_. In other words, register both
handlers first, then set running_ = true and create raft_thread_ with
&RaftNode::run_loop so there is no window where incoming messages could arrive
without handlers installed.

src/main.cpp-259-265 (1)

259-265: ⚠️ Potential issue | 🟡 Minor

Potential payload truncation if response exceeds 64KB.

The payload_len field in RpcHeader is uint16_t, limiting payloads to 65535 bytes. The cast on line 261 (static_cast<uint16_t>(resp_p.size())) will silently truncate larger payloads, corrupting the protocol framing.

For query results with many rows, this limit could be exceeded. Consider either validating the size before sending or extending the protocol to support larger payloads.

Add bounds check

                         auto resp_p = reply.serialize();
+                        if (resp_p.size() > UINT16_MAX) {
+                            // Handle oversized payload - truncate rows or return error
+                            reply.success = false;
+                            reply.error_msg = "Response too large";
+                            reply.rows.clear();
+                            resp_p = reply.serialize();
+                        }
                         cloudsql::network::RpcHeader resp_h;
                         resp_h.type = cloudsql::network::RpcType::QueryResults;
                         resp_h.payload_len = static_cast<uint16_t>(resp_p.size());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.cpp` around lines 259 - 265, The RpcHeader.payload_len is a uint16_t
and casting resp_p.size() to uint16_t in the RpcHeader creation (RpcHeader
resp_h; resp_h.payload_len = static_cast<uint16_t>(resp_p.size());) will
truncate payloads >65535; fix by validating resp_p.size() before encoding (e.g.,
if resp_p.size() > std::numeric_limits<uint16_t>::max() then return an error or
use an alternate code path), or implement framing that supports larger payloads
(e.g., extend RpcHeader.payload_len to a 32-bit field or send a
multi-part/chunked response) and update the send logic that follows (send(fd,
h_buf, 8, 0); send(fd, resp_p.data(), resp_p.size(), 0);) accordingly so no
silent truncation occurs.

tests/server_tests.cpp-101-117 (1)

101-117: ⚠️ Potential issue | 🟡 Minor

Handshake test silently passes if connection fails.

If connect() fails on line 101, the entire handshake validation is skipped and the test passes. Consider using ASSERT_EQ to fail the test if connection fails.

Proposed fix to assert connection success

-    if (connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)) == 0) {
+    ASSERT_EQ(connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)), 0)
+        << "Failed to connect to server";
+    {
         // Send startup packet
         const std::array<uint32_t, 2> startup = {htonl(static_cast<uint32_t>(STARTUP_PKT_LEN)),
                                                  htonl(196608)};

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/server_tests.cpp` around lines 101 - 117, The test currently uses if
(connect(sock, ... ) == 0) { ... } so a failed connect silently skips
assertions; change this to assert the connection succeeded using the test
framework (e.g. replace the if with ASSERT_EQ(connect(sock,
reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)), 0)) so the test fails
immediately on connection error and then keep the existing startup/send/recv
assertions (references: connect, sock, STARTUP_PKT_LEN, buffer).

🧹 Nitpick comments (13)

src/common/config.cpp (1)

67-73: Prefer explicit failure for unknown mode values.

Line 72 silently falls back to Standalone for unrecognized tokens, which can mask bad configs and start the wrong topology. Consider returning false on unknown mode strings.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/common/config.cpp` around lines 67 - 73, The current assignment silently
maps any unrecognized string to RunMode::Standalone; change the logic in the
function that parses 'value' so it only accepts the explicit tokens
"distributed"/"coordinator" (set mode = RunMode::Coordinator) and "data" (set
mode = RunMode::Data), and treat all other inputs as an error: do not set mode =
RunMode::Standalone, instead log or report the invalid token and return false
(or propagate failure) so the caller can reject the configuration; reference
variables/functions: value, mode, and enum RunMode (RunMode::Coordinator,
RunMode::Data, RunMode::Standalone) when implementing this validation change.

include/distributed/shard_manager.hpp (1)

9-12: Add direct standard headers used by this header.

This file uses std::optional, std::nullopt, and std::hash directly; add <optional> and <functional> to avoid transitive-include fragility.

Also applies to: 38-45

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/distributed/shard_manager.hpp` around lines 9 - 12, The header uses
std::optional, std::nullopt and std::hash directly but relies on transitive
includes; add explicit `#include` <optional> and `#include` <functional> at the top
of include/distributed/shard_manager.hpp (alongside the existing <cstdint>,
<string>, <vector>) so types like std::optional/std::nullopt and std::hash are
always available for symbols used in this file (e.g. any declarations/uses in
the ShardManager-related code).

tests/distributed_tests.cpp (1)

37-39: Set distributed mode explicitly in DDLRouting test setup.

Set config.mode = config::RunMode::Coordinator so this test can’t silently exercise a non-distributed default path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/distributed_tests.cpp` around lines 37 - 39, In the DDLRouting test
setup, explicitly set the run mode on the config object to prevent the test from
defaulting to a non-distributed path: after creating config::Config config,
assign config.mode = config::RunMode::Coordinator before constructing
ClusterManager cm(&config) and DistributedExecutor exec(*catalog, cm); this
ensures the test uses the distributed Coordinator mode when invoking
ClusterManager and DistributedExecutor.

CMakeLists.txt (1)

126-129: Pin googletest to a tag/commit for reproducible builds.

Using refs/heads/main.zip makes builds non-reproducible and increases supply-chain risk.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@CMakeLists.txt` around lines 126 - 129, The FetchContent_Declare usage for
googletest currently points to an unstable branch URL (URL
https://github.com/google/googletest/archive/refs/heads/main.zip) which makes
builds non-reproducible; update the FetchContent_Declare for the googletest
dependency (the googletest declaration/URL) to reference a specific tag or
commit (e.g., use the archive URL for a release tag or switch to GIT_REPOSITORY
+ GIT_TAG with a fixed tag/commit SHA) so the same version is fetched every
build and supply-chain risk is reduced.

tests/raft_simulation_tests.cpp (1)

32-37: Add a post-timeout assertion in FollowerToCandidate.

After waiting on Lines 32–33, there is no assertion, so this test can pass without verifying election behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/raft_simulation_tests.cpp` around lines 32 - 37, After the sleep in the
FollowerToCandidate test, add an assertion that the node transitioned out of
Follower (i.e., is Candidate or Leader). Locate the FollowerToCandidate test in
tests/raft_simulation_tests.cpp and, immediately after the
std::this_thread::sleep_for(...) call, assert the node's state using the
project's state accessor (e.g., node->GetState() or raft_node.GetState()) with
the testing macro (EXPECT_TRUE or ASSERT_TRUE) to check GetState() ==
State::Candidate || GetState() == State::Leader so the test verifies election
behavior.

tests/raft_tests.cpp (1)

18-29: StateTransitions currently validates only initial state.

After setup, the test only checks EXPECT_FALSE(node.is_leader()) and does not assert any transition behavior. Either add a transition assertion or rename the test to reflect initial-state validation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/raft_tests.cpp` around lines 18 - 29, The test named TEST(RaftTests,
StateTransitions) only verifies the initial state via RaftNode::is_leader() and
does not exercise any state transitions; either rename the test to indicate it
checks the initial state (e.g., "InitialState") or implement transition
assertions: initialize the RaftNode ("node1", cm, rpc), trigger a transition
path (for example start the node, advance timers or simulate an election timeout
/ append entries from peers using ClusterManager cm or network::RpcServer rpc
hooks), then assert the expected state change via node.is_leader() (or other
state-query methods) and any relevant side-effects; update the test name if you
choose the rename approach so the test intent matches its claims.

tests/distributed_txn_tests.cpp (1)

105-106: Replace fixed sleeps with deterministic synchronization in assertions.

Using fixed 50ms sleeps makes these tests timing-sensitive and flaky under slower CI hosts. Prefer waiting on a condition/poll-with-timeout tied to the atomic counters (or assert immediately if execute() is expected to block through phase 2).

Also applies to: 212-214
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/distributed_txn_tests.cpp` around lines 105 - 106, Replace the fixed
std::this_thread::sleep_for(50ms) and EXPECT_EQ(commit_count.load(), 2) with
deterministic polling or condition-wait: repeatedly check commit_count.load()
(or wait on a std::condition_variable signaled by the code that increments
commit_count) until it reaches the expected value or a reasonable timeout
elapses, and then ASSERT/EXPECT that the count reached 2 (or fail the test on
timeout); apply the same change for the other occurrence around execute() so the
test no longer relies on a fixed 50ms sleep and instead deterministically waits
for commit_count to update.

src/catalog/catalog.cpp (1)

95-100: Avoid persisting hardcoded shard endpoints in table metadata.

Writing 127.0.0.1:6432 for every new table bakes incorrect topology into catalog state outside single-node setups. This should come from actual cluster/shard assignment logic (or remain unset until assigned).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/catalog/catalog.cpp` around lines 95 - 100, The code is persisting a
hardcoded shard endpoint by populating ShardInfo with shard.node_address =
"127.0.0.1" and shard.port = 6432 and always pushing it into table->shards;
remove this baked-in topology by not setting shard.node_address/shard.port to
production values (leave them empty/zero) and avoid pushing the shard into
table->shards until a real cluster assignment occurs. Locate the ShardInfo
construction (symbols: ShardInfo, shard.node_address, shard.port,
table->shards.push_back) in catalog.cpp and either omit adding the shard here or
call the proper cluster/shard assignment routine (e.g., the cluster manager or
AssignShard method) to populate and persist endpoint details only after a real
assignment.

src/distributed/raft_node.cpp (2)

273-277: Use mutable keyword instead of const_cast.

The const_cast on line 275 is a code smell indicating the design is inconsistent. Since get_random_timeout() is marked const but needs to modify rng_, the cleaner solution is to declare rng_ as mutable in the header.
Proposed change in header (include/distributed/raft_node.hpp)
-    std::mt19937 rng_;
+    mutable std::mt19937 rng_;
Then in the implementation:
 std::chrono::milliseconds RaftNode::get_random_timeout() const {
     std::uniform_int_distribution<int> dist(TIMEOUT_MIN_MS, TIMEOUT_MAX_MS);
-    auto& mutable_rng = const_cast<std::mt19937&>(rng_);
-    return std::chrono::milliseconds(dist(mutable_rng));
+    return std::chrono::milliseconds(dist(rng_));
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/distributed/raft_node.cpp` around lines 273 - 277, The method
RaftNode::get_random_timeout() is const but mutates rng_, so remove the
const_cast by declaring the RNG member rng_ as mutable in the RaftNode class (in
the header) and then simplify get_random_timeout() to use rng_ directly; update
the declaration of rng_ (std::mt19937 rng_) to mutable std::mt19937 rng_ and
remove the const_cast in get_random_timeout().
279-280: Missing state persistence breaks Raft safety guarantees.

The empty persist_state() and load_state() implementations mean that after a restart, a node could vote twice in the same term or lose track of committed entries. This violates Raft's safety properties.

While acceptable for a POC, this should be tracked as a high-priority TODO.

Would you like me to open an issue to track implementing persistent state storage for Raft durability?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/distributed/raft_node.cpp` around lines 279 - 280, The
RaftNode::persist_state() and RaftNode::load_state() stubs must be implemented
to persist and recover Raft durable state: serialize and atomically write
currentTerm, votedFor, and the stable prefix of the log (or at least the
metadata needed for safety) in persist_state(), and read/validate/restore those
values in load_state() at startup; ensure writes are atomic
(write-to-temp+rename or use fsync) and call persist_state() whenever
currentTerm or votedFor changes or entries are committed/installed (e.g., after
Election, Vote handling, and AppendEntries apply paths), and have load_state()
validate format/version/CRC and restore RaftNode::currentTerm,
RaftNode::votedFor and log structures so the node cannot vote twice or lose
committed entries.

src/main.cpp (2)

269-294: TxnPrepare handler ignores the transaction ID.

The args variable is deserialized but args.txn_id is never used. The comment on lines 277-278 acknowledges this is a simplified implementation. Consider adding a TODO comment referencing the transaction lookup requirement, or using [[maybe_unused]] to suppress warnings.

Mark unused variable explicitly

                         (void)h;
-                        auto args = cloudsql::network::TxnOperationArgs::deserialize(p);
+                        [[maybe_unused]] auto args = cloudsql::network::TxnOperationArgs::deserialize(p);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.cpp` around lines 269 - 294, The deserialized args
(cloudsql::network::TxnOperationArgs args) in the TxnPrepare handler lambda is
never used; either mark it explicitly unused or add a TODO to avoid confusion:
inside the lambda for RpcType::TxnPrepare (the handler that currently calls
log_manager->flush), annotate the unused variable with [[maybe_unused]] or
replace it with [[maybe_unused]] auto args = ... and/or add a clear TODO comment
referencing that the implementation should look up the transaction by
args.txn_id and flush that txn's WAL (e.g., "TODO: use args.txn_id to find txn
and flush its WAL"), keeping the rest of the handler logic unchanged.

228-266: Lambda captures references to stack-local objects.

The RPC handler lambdas capture catalog, bpm, lock_manager, transaction_manager, and log_manager by reference. These objects are declared earlier in main() and remain valid for the lifetime of the program, so the captures are safe in this context.

However, the send() calls on lines 264-265 don't check return values. If the client disconnects mid-response, partial writes could occur silently.

Consider checking send() return values

-                        send(fd, h_buf, 8, 0);
-                        send(fd, resp_p.data(), resp_p.size(), 0);
+                        if (send(fd, h_buf, 8, 0) < 0 ||
+                            send(fd, resp_p.data(), resp_p.size(), 0) < 0) {
+                            // Log or handle send failure
+                        }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.cpp` around lines 228 - 266, The RPC handler registered via
rpc_server->set_handler writes the header and payload with send(fd, h_buf, 8, 0)
and send(fd, resp_p.data(), resp_p.size(), 0) but ignores return values and thus
doesn't handle partial writes or client disconnects; update the handler to check
and handle send() results for both the encoded header (resp_h.encode -> h_buf)
and the payload (resp_p.data(), resp_p.size()) by looping until all bytes are
written (or an error/EINTR/EAGAIN occurs), logging failures via log_manager and
closing or cleaning up the socket (fd) on unrecoverable errors; ensure you
reference and protect the same symbols used here (rpc_server->set_handler
lambda, resp_h.encode, h_buf, resp_p, send) when implementing the write-loop and
error handling.

tests/server_tests.cpp (1)

38-41: Consider test isolation for shared data directory.

All tests use "./test_data" as the data directory. While these tests focus on server lifecycle rather than storage, shared state could cause flakiness. Consider using unique directories per test or a test fixture with setup/teardown.

Also applies to: 51-56
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server_tests.cpp` around lines 38 - 41, The tests instantiate
StorageManager with a hardcoded "./test_data" which can create shared-state
flakiness; change tests using StorageManager, storage::BufferPoolManager, and
Server to use a unique temporary directory per test (e.g., construct a unique
path via std::filesystem::temp_directory_path() + test-specific suffix or
timestamp) and ensure cleanup in a test fixture SetUp()/TearDown() (or RAII
helper) that creates the dir before creating StorageManager and removes it after
the test completes to isolate state across tests.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 269f8de and 42e33ad.

📒 Files selected for processing (29)

CMakeLists.txt
include/catalog/catalog.hpp
include/common/cluster_manager.hpp
include/common/config.hpp
include/distributed/distributed_executor.hpp
include/distributed/raft_node.hpp
include/distributed/raft_types.hpp
include/distributed/shard_manager.hpp
include/network/rpc_client.hpp
include/network/rpc_message.hpp
include/network/rpc_server.hpp
include/network/server.hpp
include/recovery/log_record.hpp
include/transaction/transaction.hpp
include/transaction/transaction_manager.hpp
src/catalog/catalog.cpp
src/common/config.cpp
src/distributed/distributed_executor.cpp
src/distributed/raft_node.cpp
src/main.cpp
src/network/rpc_client.cpp
src/network/rpc_server.cpp
src/network/server.cpp
src/transaction/transaction_manager.cpp
tests/distributed_tests.cpp
tests/distributed_txn_tests.cpp
tests/raft_simulation_tests.cpp
tests/raft_tests.cpp
tests/server_tests.cpp

coderabbitai · 2026-03-01T12:30:47Z

CMakeLists.txt

 # Configure Sanitizers
 if (NOT USE_SANITIZER STREQUAL "none")
    if (CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU")
+        set(CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU")


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "CMakeLists.txt" -type f

Repository: poyrazK/cloudSQL

Length of output: 76

🏁 Script executed:

cat -n CMakeLists.txt

Repository: poyrazK/cloudSQL

Length of output: 7632

Remove the invalid set(CMAKE_CXX_COMPILER_ID MATCHES ...) assignment on line 23.

Line 23 uses invalid CMake syntax—the set() command does not accept the MATCHES operator. This corrupts CMAKE_CXX_COMPILER_ID and breaks downstream compiler checks on lines 39 and 87. The line is also redundant since it's already within the if (CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU") block from line 22.

Proposed fix

if (CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU") - set(CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU") string(REPLACE "," ";" SAN_LIST ${USE_SANITIZER})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@CMakeLists.txt` at line 23, Remove the invalid assignment that uses the MATCHES operator inside set(): delete the line "set(CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU")" so CMAKE_CXX_COMPILER_ID is not overwritten; the correct approach is to keep the surrounding "if (CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU")" condition and remove the redundant set(...) call that breaks downstream checks referencing CMAKE_CXX_COMPILER_ID (used later around the existing checks on lines near 39 and 87).

coderabbitai · 2026-03-01T12:30:47Z

include/distributed/raft_types.hpp

+        constexpr size_t BASE_SIZE = 24;
+        out.resize(BASE_SIZE + candidate_id.size());
+        std::memcpy(out.data(), &term, sizeof(term_t));
+        const uint64_t id_len = candidate_id.size();
+        std::memcpy(out.data() + 8, &id_len, 8);
+        std::memcpy(out.data() + 16, candidate_id.data(), id_len);
+        std::memcpy(out.data() + 16 + id_len, &last_log_index, sizeof(index_t));
+        std::memcpy(out.data() + 24 + id_len, &last_log_term, sizeof(term_t));
+        return out;


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let me read the file to see the actual code at the specified lines cat -n include/distributed/raft_types.hpp | head -100

Repository: poyrazK/cloudSQL

Length of output: 3123

🏁 Script executed:

# Search for the RequestVoteArgs class and serialize method rg -A 20 "struct RequestVoteArgs" include/distributed/raft_types.hpp

Repository: poyrazK/cloudSQL

Length of output: 804

🏁 Script executed:

# Search for type definitions of index_t and term_t rg "using (index_t|term_t)" include/distributed/raft_types.hpp

Repository: poyrazK/cloudSQL

Length of output: 110

RequestVoteArgs::serialize() under-allocates and causes out-of-bounds write.

Line 48 allocates 24 + id_len bytes, but line 54 writes 8 additional bytes at offset 24 + id_len, exceeding the buffer by 8 bytes. This is a critical memory corruption bug.

🔧 Proposed fix

[[nodiscard]] std::vector<uint8_t> serialize() const { std::vector<uint8_t> out; - constexpr size_t BASE_SIZE = 24; - out.resize(BASE_SIZE + candidate_id.size()); - std::memcpy(out.data(), &term, sizeof(term_t)); - const uint64_t id_len = candidate_id.size(); - std::memcpy(out.data() + 8, &id_len, 8); - std::memcpy(out.data() + 16, candidate_id.data(), id_len); - std::memcpy(out.data() + 16 + id_len, &last_log_index, sizeof(index_t)); - std::memcpy(out.data() + 24 + id_len, &last_log_term, sizeof(term_t)); + const uint64_t id_len = static_cast<uint64_t>(candidate_id.size()); + constexpr size_t FIXED_SIZE = + sizeof(term_t) + sizeof(uint64_t) + sizeof(index_t) + sizeof(term_t); + out.resize(FIXED_SIZE + id_len); + + size_t offset = 0; + std::memcpy(out.data() + offset, &term, sizeof(term_t)); + offset += sizeof(term_t); + std::memcpy(out.data() + offset, &id_len, sizeof(uint64_t)); + offset += sizeof(uint64_t); + std::memcpy(out.data() + offset, candidate_id.data(), id_len); + offset += id_len; + std::memcpy(out.data() + offset, &last_log_index, sizeof(index_t)); + offset += sizeof(index_t); + std::memcpy(out.data() + offset, &last_log_term, sizeof(term_t)); return out; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

constexpr size_t BASE_SIZE = 24;

out.resize(BASE_SIZE + candidate_id.size());

std::memcpy(out.data(), &term, sizeof(term_t));

const uint64_t id_len = candidate_id.size();

std::memcpy(out.data() + 8, &id_len, 8);

std::memcpy(out.data() + 16, candidate_id.data(), id_len);

std::memcpy(out.data() + 16 + id_len, &last_log_index, sizeof(index_t));

std::memcpy(out.data() + 24 + id_len, &last_log_term, sizeof(term_t));

return out;

const uint64_t id_len = static_cast<uint64_t>(candidate_id.size());

constexpr size_t FIXED_SIZE =

sizeof(term_t) + sizeof(uint64_t) + sizeof(index_t) + sizeof(term_t);

out.resize(FIXED_SIZE + id_len);

size_t offset = 0;

std::memcpy(out.data() + offset, &term, sizeof(term_t));

offset += sizeof(term_t);

std::memcpy(out.data() + offset, &id_len, sizeof(uint64_t));

offset += sizeof(uint64_t);

std::memcpy(out.data() + offset, candidate_id.data(), id_len);

offset += id_len;

std::memcpy(out.data() + offset, &last_log_index, sizeof(index_t));

offset += sizeof(index_t);

std::memcpy(out.data() + offset, &last_log_term, sizeof(term_t));

return out;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/distributed/raft_types.hpp` around lines 47 - 55, The buffer is under-allocated in RequestVoteArgs::serialize(): BASE_SIZE (24) + candidate_id.size() is too small because additional writes for last_log_index and last_log_term occur after that, causing an OOB write. Fix by computing total_size = /* header sizes */ + candidate_id.size() + sizeof(index_t) + sizeof(term_t) (or simply resize to BASE_SIZE + candidate_id.size() + sizeof(index_t) + sizeof(term_t)), update out.resize(total_size), and replace hardcoded offset constants (8,16,24) with expressions using sizeof(term_t) and candidate_id.size() so memcpy targets use correct offsets when writing term, id_len, candidate_id, last_log_index, and last_log_term in RequestVoteArgs::serialize().

coderabbitai · 2026-03-01T12:30:47Z

include/distributed/shard_manager.hpp

+    static uint32_t compute_shard(const common::Value& pk_value, uint32_t num_shards) {
+        if (num_shards == 0) return 0;
+
+        // Simple hash for demo purposes
+        std::string s = pk_value.to_string();
+        size_t hash = std::hash<std::string>{}(s);
+        return static_cast<uint32_t>(hash % num_shards);
+    }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n include/distributed/shard_manager.hpp

Repository: poyrazK/cloudSQL

Length of output: 1696

🏁 Script executed:

# Check if <functional> is needed for std::hash rg "std::hash" --type cpp # Search for shard routing usage rg "compute_shard" --type cpp # Check if there are other hash implementations in the codebase rg "FNV|fnv|deterministic.*hash|stable.*hash" --type cpp -i # Verify if ShardInfo and TableInfo are defined rg "struct ShardInfo|class ShardInfo" --type hpp # Check for num_shards == 0 usage patterns rg "num_shards.*0|0.*shard" --type cpp -A 2 -B 2

Repository: poyrazK/cloudSQL

Length of output: 1391

🏁 Script executed:

# Check the distributed_tests.cpp to see what the hash stability test expects cat -n tests/distributed_tests.cpp | head -100 # Check if <functional> is included rg "#include.*functional" include/distributed/shard_manager.hpp # Look at Value::to_string() implementation rg -A 10 "to_string()" include/common/value.hpp # Check common/value.hpp hash implementation context head -50 include/common/value.hpp

Repository: poyrazK/cloudSQL

Length of output: 4315

🏁 Script executed:

# Check if <functional> is included transitively rg "#include" include/distributed/shard_manager.hpp # Check what's in common/value.hpp that might pull in <functional> rg "#include" include/common/value.hpp # Check the actual distributed executor usage context cat -n src/distributed/distributed_executor.cpp | grep -A 5 -B 5 "compute_shard"

Repository: poyrazK/cloudSQL

Length of output: 871

🏁 Script executed:

# Check if there's any runtime validation or assertions for num_shards rg "num_shards|data_nodes.size()" src/distributed/distributed_executor.cpp -B 3 -A 3 # Verify if the Value class has its own stable hash implementation grep -A 20 "struct Hash" include/common/value.hpp # Check if the code is actually in production or marked as demo/WIP rg "TODO|FIXME|demo|WIP" include/distributed/shard_manager.hpp

Repository: poyrazK/cloudSQL

Length of output: 1131

Replace std::hash<std::string> with deterministic hashing for distributed consistency.

Line 31 uses std::hash<std::string>, which is implementation-defined and produces different hashes across compiler versions, platforms, and standard library implementations. In a distributed system, the same key must consistently hash to the same shard regardless of which node performs the computation. The current approach can silently route identical keys to different shards across nodes.

The "demo purposes" comment (line 29) confirms this is intentionally temporary. Replace with a stable hash algorithm like FNV-1a (as proposed below) or another deterministic variant that produces consistent output across all builds and deployments.

Proposed fix

- // Simple hash for demo purposes - std::string s = pk_value.to_string(); - size_t hash = std::hash<std::string>{}(s); - return static_cast<uint32_t>(hash % num_shards); + const std::string s = pk_value.to_string(); + uint64_t hash = 1469598103934665603ULL; // FNV-1a 64-bit + for (unsigned char c : s) { + hash ^= static_cast<uint64_t>(c); + hash *= 1099511628211ULL; + } + return static_cast<uint32_t>(hash % num_shards);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

static uint32_t compute_shard(const common::Value& pk_value, uint32_t num_shards) {

if (num_shards == 0) return 0;

// Simple hash for demo purposes

std::string s = pk_value.to_string();

size_t hash = std::hash<std::string>{}(s);

return static_cast<uint32_t>(hash % num_shards);

}

static uint32_t compute_shard(const common::Value& pk_value, uint32_t num_shards) {

if (num_shards == 0) return 0;

const std::string s = pk_value.to_string();

uint64_t hash = 1469598103934665603ULL; // FNV-1a 64-bit

for (unsigned char c : s) {

hash ^= static_cast<uint64_t>(c);

hash *= 1099511628211ULL;

}

return static_cast<uint32_t>(hash % num_shards);

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/distributed/shard_manager.hpp` around lines 26 - 33, compute_shard uses std::hash which is non-deterministic across platforms; replace it with a deterministic algorithm (e.g., FNV-1a) so shard assignment is consistent across nodes. In compute_shard(const common::Value& pk_value, uint32_t num_shards) compute a stable 64-bit FNV-1a hash over the bytes of pk_value.to_string() (or its canonical byte representation), then return static_cast<uint32_t>(fnv_hash % num_shards); preserve the num_shards==0 early return and ensure the hashing uses fixed-width types (uint64_t) and byte-wise processing to guarantee cross-platform stability.

coderabbitai · 2026-03-01T12:30:47Z

include/network/rpc_message.hpp

+    static QueryResultsReply deserialize(const std::vector<uint8_t>& in) {
+        QueryResultsReply reply;
+        if (in.empty()) return reply;
+
+        reply.success = in[0] != 0;
+
+        uint32_t err_len;
+        std::memcpy(&err_len, in.data() + 1, 4);
+        if (in.size() >= 5 + err_len) {
+            reply.error_msg = std::string(reinterpret_cast<const char*>(in.data() + 5), err_len);
+        }
+
+        uint32_t row_count;
+        if (in.size() >= 9 + err_len) {
+            std::memcpy(&row_count, in.data() + 5 + err_len, 4);
+            reply.rows.resize(row_count); // Placeholders
+        }
+
+        return reply;
+    }


⚠️ Potential issue | 🔴 Critical

Out-of-bounds read in QueryResultsReply::deserialize.

Line 126 reads 4 bytes at offset 1 (in.data() + 1) but only checks in.empty() beforehand. If in.size() is between 1 and 4, this causes undefined behavior. The bounds check on line 127 happens after the read.

Proposed fix to add bounds check before read

static QueryResultsReply deserialize(const std::vector<uint8_t>& in) { QueryResultsReply reply; - if (in.empty()) return reply; + if (in.size() < 5) return reply; // Need at least: success(1) + err_len(4) reply.success = in[0] != 0; uint32_t err_len; std::memcpy(&err_len, in.data() + 1, 4); + err_len = ntohl(err_len); // Convert from network byte order if (in.size() >= 5 + err_len) { reply.error_msg = std::string(reinterpret_cast<const char*>(in.data() + 5), err_len); } uint32_t row_count; - if (in.size() >= 9 + err_len) { + if (in.size() >= 9 + err_len) { // success(1) + err_len(4) + err_msg + row_count(4) std::memcpy(&row_count, in.data() + 5 + err_len, 4); + row_count = ntohl(row_count); // Convert from network byte order reply.rows.resize(row_count); // Placeholders }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_message.hpp` around lines 119 - 138, The deserialize function QueryResultsReply::deserialize performs unguarded reads (std::memcpy) for err_len and row_count causing potential OOB; before reading the 4-byte err_len at in.data()+1 ensure in.size() >= 1 + 4, and before reading row_count ensure in.size() >= 5 + err_len + 4; if the buffer is too small, return a default/partial reply or set reply.success=false and reply.error_msg to an explanatory message; adjust the checks around err_len and row_count (and any subsequent reads) so every memcpy or pointer-based string construction is preceded by a size check using in.size() and the computed err_len/row_count values.

coderabbitai · 2026-03-01T12:30:48Z

include/network/rpc_server.hpp

+    std::thread accept_thread_;
+    std::vector<std::thread> worker_threads_;
+    std::unordered_map<RpcType, RpcHandler> handlers_;
+    std::mutex handlers_mutex_;


⚠️ Potential issue | 🔴 Critical

Add synchronization/state fields for safe worker lifecycle management.

Given concurrent accept_loop() and stop() behavior, the class needs dedicated synchronization (e.g., thread/client mutex and tracked client fds) to avoid races/hangs during shutdown.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_server.hpp` around lines 60 - 63, The class lacks lifecycle synchronization for concurrent accept_loop() and stop(), leading to races/hangs; add explicit state and synchronization fields (e.g., an atomic<bool> running or stopping flag, a std::mutex clients_mutex, a std::unordered_set<int> client_fds to track active clients, and a std::condition_variable shutdown_cv) and use them in accept_loop(), worker threads, and stop() to: (1) check/flip the running flag to prevent new accepts after stop(), (2) protect modifications to client_fds with clients_mutex, (3) notify/wait on shutdown_cv to join worker_threads_ cleanly, and (4) guard handlers_ using handlers_mutex_ where handlers are modified or invoked; update accept_thread_, worker_threads_, accept_loop(), and stop() to consult these new fields to ensure orderly shutdown.

coderabbitai · 2026-03-01T12:30:48Z

src/distributed/distributed_executor.cpp

+            prepare_futures.push_back(std::async(std::launch::async, [&node, payload]() {
+                network::RpcClient client(node.address, node.cluster_port);
+                if (client.connect()) {
+                    std::vector<uint8_t> resp_payload;
+                    if (client.call(network::RpcType::TxnPrepare, payload, resp_payload)) {
+                        auto reply = network::QueryResultsReply::deserialize(resp_payload);
+                        if (reply.success) return std::make_pair(true, std::string(""));
+                        return std::make_pair(false, "[" + node.id + "] Prepare failed: " + reply.error_msg);
+                    }
+                    return std::make_pair(false, "[" + node.id + "] RPC failed during prepare");
+                }
+                return std::make_pair(false, "[" + node.id + "] Connection failed during prepare");
+            }));


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check the file structure and read the relevant sections wc -l src/distributed/distributed_executor.cpp

Repository: poyrazK/cloudSQL

Length of output: 104

🏁 Script executed:

#!/bin/bash # Read lines 50-130 to see the loop context and all three locations mentioned sed -n '50,130p' src/distributed/distributed_executor.cpp

Repository: poyrazK/cloudSQL

Length of output: 3412

🏁 Script executed:

#!/bin/bash # Verify the exact pattern: std::async with [&node] captures rg -n 'std::async.*\[&node' src/distributed/distributed_executor.cpp

Repository: poyrazK/cloudSQL

Length of output: 352

Async tasks capture loop variable node by reference, creating dangling references.

In the TransactionCommit (lines 60, 90), and TransactionRollback (line 115) handlers, the async lambdas capture node by reference within the for (const auto& node : data_nodes) loop. Since the futures are stored and awaited after the loop completes, the reference becomes invalid when the loop variable goes out of scope.

✅ Capture `node` by value

- prepare_futures.push_back(std::async(std::launch::async, [&node, payload]() { + prepare_futures.push_back(std::async(std::launch::async, [node, payload]() { @@ - phase2_futures.push_back(std::async(std::launch::async, [&node, payload, phase2_type]() { + phase2_futures.push_back(std::async(std::launch::async, [node, payload, phase2_type]() { @@ - rollback_futures.push_back(std::async(std::launch::async, [&node, payload]() { + rollback_futures.push_back(std::async(std::launch::async, [node, payload]() {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/distributed/distributed_executor.cpp` around lines 60 - 72, The async lambda captures the loop variable node by reference which leads to dangling references when futures outlive the loop; update the capture in the prepare_futures (and analogous TransactionCommit and TransactionRollback) async lambdas to capture node by value (e.g., capture node instead of &node or use node = node) so each future holds its own copy of the node and avoid use-after-scope issues when awaiting the stored futures.

coderabbitai · 2026-03-01T12:30:48Z

src/network/rpc_client.cpp

+bool RpcClient::call(RpcType type, const std::vector<uint8_t>& payload,
+                     std::vector<uint8_t>& response_out) {
+    if (!send_only(type, payload)) {
+        return false;
+    }
+
+    char header_buf[8];
+    if (recv(fd_, header_buf, 8, 0) <= 0) {
+        return false;
+    }
+
+    const RpcHeader resp_header = RpcHeader::decode(header_buf);
+    response_out.resize(resp_header.payload_len);
+    if (resp_header.payload_len > 0) {
+        static_cast<void>(recv(fd_, response_out.data(), resp_header.payload_len, 0));
+    }
+
+    return true;
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

head -100 src/network/rpc_client.cpp | cat -n

Repository: poyrazK/cloudSQL

Length of output: 3097

🏁 Script executed:

find . -type f -name "*.h" | xargs grep -l "class RpcClient" | head -1

Repository: poyrazK/cloudSQL

Length of output: 42

🏁 Script executed:

head -150 src/network/rpc_client.cpp | cat -n

Repository: poyrazK/cloudSQL

Length of output: 3138

Add loops to handle TCP partial sends/receives and protect recv calls with mutex lock.

The recv() calls on lines 63 and 70 don't hold the mutex and can return fewer bytes than requested. This causes two critical issues:

Partial transfers not handled: TCP can return fewer bytes than requested; the code treats any non-negative return as success, causing data loss or truncation.

Interleaved responses: Multiple threads can call call() concurrently. After send_only() releases the lock (line 58), another thread can send its request before the first thread's recv() (line 63) completes, causing response mixing.

The same issue affects send() on lines 89 and 93—partial sends are not retried.

Wrap recv/send calls in loops that continue until all bytes are transferred or an error occurs. Extend the mutex lock scope in call() to cover the entire recv sequence (lines 63-70).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_client.cpp` around lines 56 - 74, The RpcClient::call implementation must handle partial TCP transfers and prevent interleaved responses by extending the mutex lock around the entire receive sequence and retrying partial recv/write until all bytes are transferred or an error occurs: inside RpcClient::call, after send_only(...) returns, acquire the connection mutex and perform a looped recv on fd_ to read exactly 8 header bytes (feeding into header_buf) and then decode with RpcHeader::decode, resize response_out, and looped recv until resp_header.payload_len bytes are fully read into response_out; similarly, update the send(...) implementation (and any send-only helper used by send_only) to loop until the full buffer is written (handling partial returns and errors) so that both send and recv always transfer the requested byte counts and the call() mutex covers the entire receive phase to avoid interleaving between threads.

coderabbitai · 2026-03-01T12:30:48Z

src/network/rpc_client.cpp

+    const std::scoped_lock<std::mutex> lock(mutex_);
+    if (fd_ < 0 && !connect()) {
+        return false;


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the send_only() and connect() implementations rg -A 20 "bool RpcClient::send_only" src/network/rpc_client.cpp

Repository: poyrazK/cloudSQL

Length of output: 627

🏁 Script executed:

# Also check the connect() method implementation rg -A 20 "bool RpcClient::connect()" src/network/rpc_client.cpp

Repository: poyrazK/cloudSQL

Length of output: 626

🏁 Script executed:

# Check the mutex_ declaration and type rg "mutex_" src/network/rpc_client.cpp | head -20

Repository: poyrazK/cloudSQL

Length of output: 218

🏁 Script executed:

# Let's also check the header file to see class definition fd "rpc_client.h" src/network/

Repository: poyrazK/cloudSQL

Length of output: 42

🏁 Script executed:

# Find and read the rpc_client.h header rg -l "rpc_client" src/network/ | grep "\.h"

Repository: poyrazK/cloudSQL

Length of output: 42

🏁 Script executed:

# Read the class definition and member variables rg -B 5 -A 10 "std::mutex.*mutex_|mutex_.*std::mutex" src/network/rpc_client.h

Repository: poyrazK/cloudSQL

Length of output: 124

🏁 Script executed:

# Search for the header file more broadly fd -e h "rpc_client"

Repository: poyrazK/cloudSQL

Length of output: 42

🏁 Script executed:

# Also check the actual directory structure of src/network ls -la src/network/

Repository: poyrazK/cloudSQL

Length of output: 369

🏁 Script executed:

# Check if there's a header file somewhere else in the repo fd -e h "rpc_client"

Repository: poyrazK/cloudSQL

Length of output: 42

🏁 Script executed:

# Look at the beginning of rpc_client.cpp for class definition or includes head -50 src/network/rpc_client.cpp

Repository: poyrazK/cloudSQL

Length of output: 1180

🏁 Script executed:

# Find the rpc_client.hpp header file fd -e hpp "rpc_client"

Repository: poyrazK/cloudSQL

Length of output: 90

🏁 Script executed:

# Read the header file to confirm mutex_ type cat -n include/network/rpc_client.hpp

Repository: poyrazK/cloudSQL

Length of output: 1464

Fix self-deadlock in send_only() when reconnecting.

When fd_ is negative, line 78 calls connect(), which attempts to lock mutex_ (line 43 in header: std::mutex mutex_). However, the same mutex is already held by std::scoped_lock at line 77. Since std::mutex is non-recursive, this causes a deadlock.

The proposed fix using std::unique_lock with explicit unlock/lock around the connect() call resolves this while maintaining thread safety (the fd_ check in connect() prevents race conditions).

Proposed fix

-bool RpcClient::send_only(RpcType type, const std::vector<uint8_t>& payload) { - const std::scoped_lock<std::mutex> lock(mutex_); - if (fd_ < 0 && !connect()) { - return false; - } +bool RpcClient::send_only(RpcType type, const std::vector<uint8_t>& payload) { + std::unique_lock<std::mutex> lock(mutex_); + if (fd_ < 0) { + lock.unlock(); + if (!connect()) { + return false; + } + lock.lock(); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const std::scoped_lock<std::mutex> lock(mutex_);

if (fd_ < 0 && !connect()) {

return false;

bool RpcClient::send_only(RpcType type, const std::vector<uint8_t>& payload) {

std::unique_lock<std::mutex> lock(mutex_);

if (fd_ < 0) {

lock.unlock();

if (!connect()) {

return false;

}

lock.lock();

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_client.cpp` around lines 77 - 79, The send_only() function holds mutex_ via std::scoped_lock and then calls connect(), but connect() also locks mutex_, causing a self-deadlock; change send_only() to use std::unique_lock<std::mutex> lock(mutex_) so you can temporarily unlock before calling connect() and re-lock afterward (i.e., check fd_, if fd_ < 0 then lock.unlock(); if (!connect()) return false; lock.lock();), preserving the fd_ checks and thread-safety around fd_ and using the existing mutex_ and connect() implementation.

coderabbitai · 2026-03-01T12:30:48Z

src/network/rpc_server.cpp

+        const ssize_t n = recv(client_fd, header_buf, 8, 0);
+        if (n <= 0) {
+            break;
+        }
+
+        const RpcHeader header = RpcHeader::decode(header_buf);
+        std::vector<uint8_t> payload(header.payload_len);
+        if (header.payload_len > 0) {
+            static_cast<void>(recv(client_fd, payload.data(), header.payload_len, 0));
+        }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, verify the file exists and check the specific lines mentioned echo "=== Checking file existence and line count ===" wc -l src/network/rpc_server.cpp echo -e "\n=== Reading lines 90-110 for context ===" sed -n '90,110p' src/network/rpc_server.cpp

Repository: poyrazK/cloudSQL

Length of output: 859

🏁 Script executed:

#!/bin/bash # Run the provided verification script to find recv() patterns rg -nP 'recv$client_fd,\s*header_buf,\s*8|recv\(client_fd,\s*payload\.data\($,\s*header\.payload_len' src/network/rpc_server.cpp

Repository: poyrazK/cloudSQL

Length of output: 214

Fix incomplete recv() calls that can corrupt RPC framing.

TCP reads are not guaranteed to return the requested number of bytes. Line 95 only checks if (n <= 0) but does not verify that exactly 8 bytes were received; a partial header will be decoded incorrectly. Line 103 ignores the payload recv() return value entirely, silently accepting incomplete reads and causing protocol desynchronization on the next frame.

Implement a loop (e.g., recv_all()) to guarantee complete reads for both the 8-byte header and full payload:

Suggested fix pattern

+namespace { +ssize_t recv_all(int fd, void* buf, size_t len) { + size_t total = 0; + auto* out = static_cast<char*>(buf); + while (total < len) { + const ssize_t n = recv(fd, out + total, len - total, 0); + if (n <= 0) return n; + total += static_cast<size_t>(n); + } + return static_cast<ssize_t>(total); +} +} // namespace @@ - const ssize_t n = recv(client_fd, header_buf, 8, 0); - if (n <= 0) { + const ssize_t n = recv_all(client_fd, header_buf, sizeof(header_buf)); + if (n != static_cast<ssize_t>(sizeof(header_buf))) { break; } @@ std::vector<uint8_t> payload(header.payload_len); if (header.payload_len > 0) { - static_cast<void>(recv(client_fd, payload.data(), header.payload_len, 0)); + if (recv_all(client_fd, payload.data(), header.payload_len) != + static_cast<ssize_t>(header.payload_len)) { + break; + } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_server.cpp` around lines 95 - 104, The current recv usage can return partial reads and corrupt RPC framing; implement a blocking helper like recv_all(int fd, void* buf, size_t len) that loops until len bytes are read or an error/EOF occurs, and use it to read the 8-byte header into header_buf (before calling RpcHeader::decode) and to read header.payload_len bytes into payload.data(); ensure you check the return of recv_all and handle short/failed reads by closing the connection or breaking the loop (use the same client_fd, header_buf, payload, RpcHeader::decode symbols to locate where to replace the single recv calls).

coderabbitai · 2026-03-01T12:30:48Z

src/transaction/transaction_manager.cpp

 void TransactionManager::commit(Transaction* txn) {
+    if (txn->get_state() == TransactionState::COMMITTED) {
+        return;
+    }


⚠️ Potential issue | 🔴 Critical

Prevent invalid terminal-state transitions in commit() and abort().

Current guards allow committing an already-aborted transaction and aborting an already-committed transaction, which can corrupt transaction state/history.

✅ Suggested guard hardening

void TransactionManager::commit(Transaction* txn) { + if (txn == nullptr) { + return; + } if (txn->get_state() == TransactionState::COMMITTED) { return; } + if (txn->get_state() == TransactionState::ABORTED) { + return; + } @@ void TransactionManager::abort(Transaction* txn) { - if (txn->get_state() == TransactionState::ABORTED) { + if (txn == nullptr) { + return; + } + if (txn->get_state() == TransactionState::ABORTED || + txn->get_state() == TransactionState::COMMITTED) { return; } - /* Undo all changes if not already committed */ - if (txn->get_state() != TransactionState::COMMITTED) { - undo_transaction(txn); - } + /* Undo all changes */ + undo_transaction(txn);

Also applies to: 115-117, 119-142

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/transaction/transaction_manager.cpp` around lines 75 - 78, The commit() and abort() methods must reject transitions from one terminal state to the other: in TransactionManager::commit(Transaction* txn) add a guard that if txn->get_state() == TransactionState::ABORTED then do not proceed (log/return or throw) so an aborted txn cannot be committed; likewise in TransactionManager::abort(Transaction* txn) add a guard that if txn->get_state() == TransactionState::COMMITTED then do not proceed so a committed txn cannot be aborted. Use the existing Transaction::get_state() and TransactionState enum values (COMMITTED, ABORTED) to implement these checks and ensure no state/history mutation occurs when a terminal-state conflict is detected.

coderabbitai

Actionable comments posted: 18

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/server_tests.cpp (1)
1-124: ⚠️ Potential issue | 🟡 Minor

clang-format CI failure is still open for this file.

GitHub Actions reports formatting violations here. Please run clang-format -i tests/server_tests.cpp and commit the result so the formatting gate passes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server_tests.cpp` around lines 1 - 124, Run clang-format -i on the test
file that contains the ServerTests (tests with TEST(ServerTests, StatusStrings),
TEST(ServerTests, Lifecycle) and TEST(ServerTests, Handshake)) to fix the
formatting violations reported by CI, then stage and commit the reformatted file
so the formatting gate passes; no code changes besides formatting should be
made.

♻️ Duplicate comments (8)

include/network/rpc_message.hpp (1)

120-135: ⚠️ Potential issue | 🔴 Critical

Fix unsafe and non-portable deserialization in QueryResultsReply.

Line 127 copies 4 bytes without first ensuring in.size() >= 5, which is an OOB read risk. Also, err_len/row_count should be normalized with network byte order for cross-node portability.

Suggested fix

     static QueryResultsReply deserialize(const std::vector<uint8_t>& in) {
         QueryResultsReply reply;
-        if (in.empty()) return reply;
+        if (in.size() < 5) return reply;  // success(1) + err_len(4)

         reply.success = in[0] != 0;

-        uint32_t err_len;
+        uint32_t err_len = 0;
         std::memcpy(&err_len, in.data() + 1, 4);
+        err_len = ntohl(err_len);
         if (in.size() >= 5 + err_len) {
             reply.error_msg = std::string(reinterpret_cast<const char*>(in.data() + 5), err_len);
         }

-        uint32_t row_count;
+        uint32_t row_count = 0;
         if (in.size() >= 9 + err_len) {
             std::memcpy(&row_count, in.data() + 5 + err_len, 4);
+            row_count = ntohl(row_count);
             reply.rows.resize(row_count);
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/network/rpc_message.hpp` around lines 120 - 135, In
QueryResultsReply::deserialize, avoid the OOB read and non-portable integer
parsing by first verifying in.size() >= 5 before copying err_len and then
verifying in.size() >= 5 + err_len + 4 before reading row_count; replace the raw
memcpy interpretation of err_len and row_count with network-byte-order-aware
conversion (e.g., read uint32_t from the byte stream then apply
ntohl/ntohl-equivalent) so values are normalized across hosts; update the checks
around reply.error_msg and reply.rows resizing to use these validated, converted
lengths and bail out or throw if sizes would overflow or exceed in.size()
bounds.

src/network/rpc_server.cpp (1)

93-102: ⚠️ Potential issue | 🔴 Critical

Use recv_all for header/payload to avoid frame corruption.

Line 93 and Line 101 assume one recv() returns all requested bytes. TCP does not guarantee that, so header/payload can be truncated and misparsed.

Suggested fix pattern

+namespace {
+ssize_t recv_all(int fd, void* buf, size_t len) {
+    size_t off = 0;
+    auto* p = static_cast<char*>(buf);
+    while (off < len) {
+        const ssize_t n = recv(fd, p + off, len - off, 0);
+        if (n <= 0) return n;
+        off += static_cast<size_t>(n);
+    }
+    return static_cast<ssize_t>(off);
+}
+}  // namespace
@@
-        const ssize_t n = recv(client_fd, header_buf, 8, 0);
-        if (n <= 0) {
+        const ssize_t n = recv_all(client_fd, header_buf, sizeof(header_buf));
+        if (n != static_cast<ssize_t>(sizeof(header_buf))) {
             break;
         }
@@
-            static_cast<void>(recv(client_fd, payload.data(), header.payload_len, 0));
+            if (recv_all(client_fd, payload.data(), header.payload_len) !=
+                static_cast<ssize_t>(header.payload_len)) {
+                break;
+            }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/network/rpc_server.cpp` around lines 93 - 102, The recv calls reading the
RPC frame are assuming full reads; replace them with a robust "recv_all" style
loop to ensure exactly 8 bytes are read into header_buf and then exactly
header.payload_len bytes into payload (or return/error if peer closes/short
read). Update the code paths around header_buf, RpcHeader::decode(header_buf),
payload and client_fd to call/use recv_all(client_fd, header_buf, 8) and
recv_all(client_fd, payload.data(), header.payload_len) (or equivalent helper
that returns bytes read or an error) and handle partial reads/EOF by
breaking/closing the connection before attempting to decode or process a
truncated frame.

include/distributed/shard_manager.hpp (1)

26-33: ⚠️ Potential issue | 🔴 Critical

Use deterministic hashing for shard selection.

Line 31 uses std::hash<std::string>, which is not stable across implementations. That can route the same key to different shards on different nodes/builds.

Suggested fix (deterministic FNV-1a 64-bit)

-        // Simple hash for demo purposes
-        std::string s = pk_value.to_string();
-        size_t hash = std::hash<std::string>{}(s);
-        return static_cast<uint32_t>(hash % num_shards);
+        const std::string s = pk_value.to_string();
+        uint64_t hash = 1469598103934665603ULL;  // FNV-1a offset basis
+        for (unsigned char c : s) {
+            hash ^= static_cast<uint64_t>(c);
+            hash *= 1099511628211ULL;            // FNV prime
+        }
+        return static_cast<uint32_t>(hash % num_shards);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/distributed/shard_manager.hpp` around lines 26 - 33, compute_shard
currently uses std::hash<std::string> which is non-deterministic across builds;
replace it with a deterministic FNV-1a 64-bit hash implementation inside
compute_shard (use the bytes of pk_value.to_string() or pk_value.data() if
available), compute the 64-bit FNV-1a hash, then take hash % num_shards and
return as uint32_t; ensure the function still handles num_shards==0 and casts
the 64-bit result safely to uint32_t when returning.

include/network/rpc_server.hpp (1)

54-65: ⚠️ Potential issue | 🔴 Critical

Add explicit client-lifecycle shutdown state to prevent stop() hangs.

The class still lacks client-fd tracking and shutdown coordination fields, so stop() cannot proactively unblock handle_client() threads waiting in recv().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/network/rpc_server.hpp` around lines 54 - 65, Add explicit
client-lifecycle shutdown state and coordination to allow stop() to unblock
threads in recv(): introduce a thread-safe container for active client fds
(e.g., std::unordered_set<int> client_fds_ with std::mutex client_fds_mutex_),
an atomic<bool> shutting_down_{false} (or similar) and a std::condition_variable
shutdown_cv_ to signal shutdown. Update accept_loop() to stop accepting when
shutting_down_ is set; register/deregister fds in handle_client() using
client_fds_ and periodically check shutting_down_ (or rely on shutdown_cv_) so
the loop can exit; implement stop() to set shutting_down_, lock
client_fds_mutex_, close all tracked fds to unblock recv(), notify shutdown_cv_,
and join worker/accept threads. Ensure handlers_ synchronization remains
unchanged.

src/network/rpc_client.cpp (2)

58-73: ⚠️ Potential issue | 🔴 Critical

Handle partial TCP transfers and keep response receive serialized per call.

Single send()/recv() calls are not guaranteed to transfer full buffers, and call() reads response outside the lock, so concurrent calls can interleave responses on the same socket.

Also applies to: 91-96

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/network/rpc_client.cpp` around lines 58 - 73, RpcClient::call currently
assumes single send()/recv() complete transfers and reads the response outside
any synchronization, which can cause partial reads/writes and interleaved
responses across concurrent calls; fix by (1) implementing and using robust
readn/writeN helpers (e.g., readFull/writeFull) that loop until the requested
byte count is transferred or an error/EOF occurs, and replace the direct
recv(fd_, ...) calls in call (and the similar block at the other occurrence)
with readFull to fully read the 8-byte RpcHeader and then the full payload; and
(2) serialize per-call socket use by protecting the entire send+receive sequence
in RpcClient::call (including the send_only invocation or replace it with
writeFull) with a mutex (e.g., a member std::mutex) so concurrent calls cannot
interleave on fd_. Ensure errors from partial transfers propagate as failures
from call().

79-81: ⚠️ Potential issue | 🔴 Critical

Fix self-deadlock in send_only() reconnect path.

Line 80 calls connect() while mutex_ is already held (Line 79), and connect() also locks mutex_. With std::mutex, this can deadlock.

Suggested fix

-bool RpcClient::send_only(RpcType type, const std::vector<uint8_t>& payload) {
-    const std::scoped_lock<std::mutex> lock(mutex_);
-    if (fd_ < 0 && !connect()) {
-        return false;
-    }
+bool RpcClient::send_only(RpcType type, const std::vector<uint8_t>& payload) {
+    std::unique_lock<std::mutex> lock(mutex_);
+    if (fd_ < 0) {
+        lock.unlock();
+        if (!connect()) {
+            return false;
+        }
+        lock.lock();
+    }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/network/rpc_client.cpp` around lines 79 - 81, send_only() currently holds
std::scoped_lock<std::mutex> lock(mutex_) and then calls connect(), but
connect() also locks mutex_, causing a self-deadlock; fix by checking fd_ under
the scoped_lock, then release the lock before invoking connect() (e.g., drop the
scoped_lock or limit its scope) and after connect() re-acquire the mutex to
re-check fd_/state as needed; reference the send_only() function, the connect()
method, the mutex_ member and fd_ variable when making this change.

src/distributed/distributed_executor.cpp (2)

60-73: ⚠️ Potential issue | 🔴 Critical

Async tasks capture loop variable node by reference.

Each future may run after the loop advances, so &node is unsafe here.

✅ Capture `node` by value

-            prepare_futures.push_back(std::async(std::launch::async, [&node, payload]() {
+            prepare_futures.push_back(std::async(std::launch::async, [node, payload]() {
@@
-                std::async(std::launch::async, [&node, payload, phase2_type]() {
+                std::async(std::launch::async, [node, payload, phase2_type]() {
@@
-            rollback_futures.push_back(std::async(std::launch::async, [&node, payload]() {
+            rollback_futures.push_back(std::async(std::launch::async, [node, payload]() {

#!/bin/bash
# Verify no async lambda still captures loop variable `node` by reference.
rg -nP 'std::async\([^)]*\[\s*&node' src/distributed/distributed_executor.cpp
# Expected: no matches

Also applies to: 92-99, 117-125

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/distributed/distributed_executor.cpp` around lines 60 - 73, Loop-captured
variable `node` is captured by reference in the std::async lambdas (e.g., the
lambda pushed into prepare_futures), which is unsafe; change the lambda captures
to capture node by value (e.g., replace [&node, payload] with [node, payload])
for the prepare_futures async call and do the same for the other async lambdas
referenced (the similar std::async calls around lines 92-99 and 117-125) so each
future holds its own copy of node instead of referencing a loop variable.

32-37: ⚠️ Potential issue | 🔴 Critical

DDL path still returns success without applying any change.

This silently acknowledges DDL while doing nothing.

🚫 Safer interim behavior

     if (type == parser::StmtType::CreateTable || type == parser::StmtType::DropTable ||
         type == parser::StmtType::CreateIndex || type == parser::StmtType::DropIndex) {
-        // These are handled by Raft via the Catalog locally on the leader
-        // and replicated to followers.
-        return QueryResult();  // Default is success
+        QueryResult res;
+        res.set_error("Distributed DDL execution is not implemented");
+        return res;
     }

🟡 Minor comments (5)

src/network/rpc_client.cpp-1-103 (1)

1-103: ⚠️ Potential issue | 🟡 Minor

Formatting check is failing for this file.

CI reports clang-format failure; reformat this file to satisfy pipeline checks.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/network/rpc_client.cpp` around lines 1 - 103, The file fails
clang-format; reformat the RpcClient implementation to match the project's
formatting rules (e.g., run clang-format with the repo's config) so indentation,
spacing, and brace placement are correct across the RpcClient class and its
methods (RpcClient::connect, RpcClient::~RpcClient, RpcClient::disconnect,
RpcClient::call, RpcClient::send_only), ensuring includes, namespace closing,
and function signatures follow the configured style.

src/network/rpc_server.cpp-1-119 (1)

1-119: ⚠️ Potential issue | 🟡 Minor

Formatting check is failing for this file.

CI reports clang-format failure; please run formatting on this translation unit to unblock the pipeline.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/network/rpc_server.cpp` around lines 1 - 119, This file fails
clang-format; reformat the translation unit (rpc_server.cpp) to match project
style by running clang-format (or your project's format command) and committing
the updated file so CI passes—apply formatting across the entire file including
functions RpcServer::start, RpcServer::stop, RpcServer::set_handler,
RpcServer::accept_loop and RpcServer::handle_client, then re-run CI to ensure
the clang-format check succeeds.

src/network/server.cpp-95-107 (1)

95-107: ⚠️ Potential issue | 🟡 Minor

Formatting check is failing for this file in CI.

Please run clang-format -i on this file to unblock the pipeline.

Also applies to: 122-122, 261-261, 268-268

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/network/server.cpp` around lines 95 - 107, Run clang-format -i on
src/network/server.cpp to fix the CI formatting failures; reformat the file so
the constructor Server::Server(...) initializer list and the factory
Server::create(...) follow the project's clang-format rules (ensure lines around
Server::Server, Server::create, and the regions mentioned at lines ~122, ~261,
~268 are reformatted). Commit the reformatted file so CI formatting checks pass.

tests/distributed_txn_tests.cpp-36-39 (1)

36-39: ⚠️ Potential issue | 🟡 Minor

Missing null checks before dereferencing parsed statements.

Please assert parser success before *stmt in each test.

🧪 Proposed fix pattern

     Parser parser(std::move(lexer));
     auto stmt = parser.parse_statement();
+    ASSERT_NE(stmt, nullptr);

     auto res = exec.execute(*stmt, "COMMIT");

Also applies to: 101-104, 209-212

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/distributed_txn_tests.cpp` around lines 36 - 39, The tests dereference
the parsed statement pointer (*stmt) without asserting parsing succeeded; add a
null-check assertion (e.g., ASSERT_TRUE(stmt) or ASSERT_NE(stmt, nullptr))
immediately after calling parser.parse_statement() before using *stmt in calls
like exec.execute(*stmt, "COMMIT") (apply the same fix for the other occurrences
around the blocks with parser.parse_statement() at the noted locations).

tests/distributed_tests.cpp-43-46 (1)

43-46: ⚠️ Potential issue | 🟡 Minor

Guard parser output before dereference.

Add ASSERT_NE(stmt, nullptr) before *stmt.

🧪 Proposed fix

     Parser parser(std::move(lexer));
     auto stmt = parser.parse_statement();
+    ASSERT_NE(stmt, nullptr);

     auto res = exec.execute(*stmt, "CREATE TABLE test (id INT)");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/distributed_tests.cpp` around lines 43 - 46, The test dereferences the
result of parser.parse_statement() without checking for null; add a guard
asserting the pointer is non-null before dereference by inserting
ASSERT_NE(stmt, nullptr) after auto stmt = parser.parse_statement(), so that the
subsequent call exec.execute(*stmt, "CREATE TABLE test (id INT)") is safe and
the test fails clearly if parsing returns null.

🧹 Nitpick comments (4)

include/distributed/distributed_executor.hpp (1)

1-5: Clean up duplicated doc-comment opener.

Line 2 repeats /** inside the existing comment block; removing it improves header clarity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/distributed/distributed_executor.hpp` around lines 1 - 5, The
file-level doc comment at the top of distributed_executor.hpp contains a
duplicated opener ("/*" or "/**") on the second line; remove the extra "/**" so
the header comment is a single well-formed block comment describing the file
(leaving the existing lines with "@file distributed_executor.hpp" and "@brief
High-level executor for distributed queries" intact).

include/distributed/shard_manager.hpp (1)

9-12: Make the header self-contained with direct std includes.

This file uses std::optional and std::hash but does not include <optional> and <functional> directly. Add them explicitly to avoid transitive-include fragility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/distributed/shard_manager.hpp` around lines 9 - 12, The header is not
self-contained: it uses std::optional and std::hash but only includes <string>,
<vector>, <cstdint>; add direct includes for <optional> and <functional> at the
top of include/distributed/shard_manager.hpp so symbols like std::optional and
std::hash resolve without relying on transitive includes (ensure the new
includes appear alongside the existing `#include` lines).

tests/distributed_txn_tests.cpp (1)

107-109: Avoid fixed sleeps for synchronization in deterministic unit tests.

These waits are unnecessary if execute() already blocks until async phases complete.

Also applies to: 218-220

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/distributed_txn_tests.cpp` around lines 107 - 109, Remove the fixed
std::this_thread::sleep_for(...) calls and rely on execute() to block until
async phases complete; specifically delete the sleep before the
EXPECT_EQ(commit_count.load(), 2) check (and the analogous sleep at the other
occurrence) so the test asserts immediately after txn.execute() or, if execute()
cannot be relied on, replace the sleep with a proper wait on commit_count using
a small-timeout loop or condition_variable tied to commit_count rather than a
fixed sleep. Ensure references to commit_count and execute() remain the
synchronization points in the test.

tests/raft_simulation_tests.cpp (1)

33-38: FollowerToCandidate has no assertion after the election window.

Right now it cannot fail on missing transition behavior. Please add an observable post-timeout expectation (or expose state for tests).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/raft_simulation_tests.cpp` around lines 33 - 38, The test
FollowerToCandidate currently only sleeps and never asserts the node's
post-election state; add a concrete expectation after the sleep that verifies
the node transitioned (e.g., assert node->getState() == State::Candidate ||
node->getState() == State::Leader). If no accessor exists, expose the state for
tests by adding a read-only accessor like RaftNode::getState() (or make the
state member test-visible) and use that in the test assertion to fail when the
transition did not occur.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/network/rpc_client.hpp`:
- Line 27: The is_connected() accessor reads fd_ without synchronization causing
a data race against connect() and disconnect() which hold mutex_; fix by
synchronizing access: either acquire the same mutex_ inside is_connected() (lock
mutex_ and return fd_ >= 0) or change fd_ to std::atomic<int> and update
connect()/disconnect() accordingly; reference is_connected(), connect(),
disconnect(), fd_, and mutex_ to locate the changes and ensure all reads/writes
use the chosen synchronization method.

In `@include/network/rpc_message.hpp`:
- Around line 148-157: The code serializes txn_id in TxnOperationArgs by
memcpy’ing host-endian uint64_t directly; update TxnOperationArgs::serialize to
convert txn_id to network byte order with htonll() before writing into the
output buffer and update TxnOperationArgs::deserialize to read the 8 bytes then
convert back with ntohll() into args.txn_id; apply the same pattern to other
numeric fields mentioned (e.g., QueryResultsReply::err_len and row_count) so all
multi-byte integer fields use htonll/ntohll (or appropriate hton*/ntoh* for
sizes) when serializing/deserializing.

In `@src/catalog/catalog.cpp`:
- Around line 75-82: When raft_node_ is set the code currently calls
create_table_local and returns local success without replicating via Raft;
change Catalog::create_table so that if raft_node_ != nullptr it does not
perform local-only DDL but instead returns/throws an explicit "not supported in
raft mode" error (or logs and aborts the operation) until the Raft
serialization/replication is implemented, rather than calling
create_table_local; apply the same change pattern to the other DDL paths that
currently bypass replication (e.g., any other methods that check raft_node_ and
call create_*_local or drop_*_local).
- Around line 95-100: The code currently hard-codes a default shard to localhost
by creating a ShardInfo with shard.shard_id = 0, shard.node_address =
"127.0.0.1", shard.port = 6432 and pushing it into table->shards; replace this
static assignment with a configurable or discovery-backed default: read the
shard address/port from configuration or environment (or call an existing helper
like getLocalAddress()/getClusterConfig() if available) and populate ShardInfo
from those values, falling back to a safe sentinel (e.g., empty node_address or
INADDR_ANY/0) rather than "127.0.0.1"; keep the same symbols (ShardInfo,
shard.shard_id, shard.node_address, shard.port, table->shards.push_back) so
changes are localized.

In `@src/distributed/distributed_executor.cpp`:
- Around line 47-55: The code incorrectly hardcodes GLOBAL_TXN_ID = 1 causing
cross-wiring of independent transactions; instead remove the constexpr and get
the real transaction id for this operation (e.g., from the execution context or
transaction manager) and assign it to network::TxnOperationArgs::txn_id before
serialize()—for example, obtain currentTxnId via your
TransactionManager/ExecutionContext API (or call a
TransactionIdGenerator/BeginTxn helper) and use that id where
parser::StmtType::TransactionCommit is handled so each commit/rollback uses the
correct unique txn id rather than GLOBAL_TXN_ID.
- Around line 86-107: Phase-2 RPC failures are currently ignored: when sending
TxnCommit via phase2_futures the result of network::RpcClient::call is discarded
so all_prepared can still return success even if some commits failed; update the
phase-2 loop to collect and enforce outcomes by having each async task return a
status (bool or error string) instead of void (e.g., change phase2_futures to
std::vector<std::future<ResultType>> and have the lambda capture
phase2_type/payload and return success/failure based on client.call), then after
joining examine all futures and if any commit failed set QueryResult to an error
(using res.set_error(...)) and return failure; ensure you reference phase2_type,
phase2_futures, network::RpcClient, client.call, all_prepared, and QueryResult
when making the change.

In `@src/distributed/raft_node.cpp`:
- Around line 47-53: The RpcServer handlers installed via
rpc_server_.set_handler for RpcType::RequestVote and RpcType::AppendEntries (and
the other handlers at 55-61) capture this and are not removed on RaftNode
shutdown; update RaftNode::stop() to unregister or replace those handlers with
non-capturing no-op handlers (or null/empty callbacks) so RpcServer will not
call into a destroyed RaftNode; specifically remove or overwrite handlers set by
rpc_server_.set_handler(...) for RequestVote and AppendEntries (and the other
listed RpcType handlers) in stop(), and ensure any replacement handlers do not
capture this and safely ignore incoming RPCs.
- Around line 196-225: The mutex_ is held across blocking send() calls which can
stall Raft; instead, build and persist all state changes while holding mutex_
(updates to persistent_state_, last_heartbeat_, reply.vote_granted, calls to
persist_state() and step_down(term)), then copy the reply into independent
buffers (e.g. std::vector<uint8_t> and header char[8]) while still holding the
lock or immediately after making a local copy, release mutex_, and perform
send(client_fd, ...) only after unlocking. Apply the same pattern to the other
handler (lines ~237-262): do state mutations under mutex_, prepare/send buffers
after unlocking to avoid holding mutex_ during network I/O.
- Around line 190-195: Before constructing candidate_id from payload, validate
payload length against the parsed id_len: after reading term and id_len
(variables term and id_len using payload), check that payload.size() >= 16 +
id_len and handle the error case (e.g., return an error, throw, or log and
ignore the packet) to avoid out-of-bounds reads; update the code around the
candidate_id construction (where candidate_id is created from payload.data() +
16) to perform this bounds check and use the safe path when the check fails.
- Around line 82-87: The predicate passed to cv_.wait_for re-calls
get_random_timeout(), re-randomizing the threshold; capture the previously
computed timeout and use it in the lambda instead. Update the cv_.wait_for call
so the lambda captures the local timeout (e.g., [this, timeout] or [timeout,
this]) and compares (std::chrono::system_clock::now() - last_heartbeat_ >
timeout) rather than calling get_random_timeout() again; keep checks for
!running_ and last_heartbeat_ unchanged.

In `@src/main.cpp`:
- Around line 149-157: The command-line parser currently maps any unknown --mode
value to cloudsql::config::RunMode::Standalone which hides typos; change the
else branch in the parsing block that reads mode_str to validate allowed values
("coordinator" or "distributed" -> cloudsql::config::RunMode::Coordinator,
"data" -> cloudsql::config::RunMode::Data, and optionally accept "standalone" ->
cloudsql::config::RunMode::Standalone) and if mode_str is none of those, print a
clear error mentioning permitted values (use the same variable names like
mode_str and config.mode/RunMode) and terminate with a non-zero exit (e.g.,
return 1 or std::exit(1)) so invalid modes are rejected instead of silently
defaulting.
- Around line 270-351: The 2PC RPC handlers currently assume success even when
the transaction doesn't exist; update the TxnPrepare, TxnCommit, and TxnAbort
handlers to validate the transaction returned by
transaction_manager.get_transaction(args.txn_id) and return failure if it's null
(set reply.success = false and a descriptive reply.error_msg). In the TxnPrepare
handler (RpcType::TxnPrepare) also perform the real prepare operation on the
found txn (e.g., call transaction_manager.prepare(txn) or the equivalent prepare
method) before setting reply.success = true, and catch/propagate errors from
that prepare call; for TxnCommit and TxnAbort only call
transaction_manager.commit(txn) / transaction_manager.abort(txn) when txn is
non-null and otherwise set failure with an error message.
- Around line 259-267: The RPC reply writes are unsafe: fix the code paths that
build and send responses (where reply.serialize(), cloudsql::network::RpcHeader
resp_h.encode(...), and send(fd, ...) are used) by (1) validating resp_p.size()
fits into uint16_t before assigning resp_h.payload_len and return/log an error
if too large to avoid silent truncation, (2) performing looped writes that check
send(...) return values and retry until the full header and payload are written
(handle partial writes), and (3) protect against SIGPIPE by using MSG_NOSIGNAL
on send or installing a SIGPIPE handler and treating send errors (including
EPIPE/EINTR) by aborting the send, closing the socket, and logging the errno;
apply these changes for every place where resp_h/payload and send are used.

In `@src/network/rpc_client.cpp`:
- Line 86: The assignment header.payload_len =
static_cast<uint16_t>(payload.size()) can silently truncate payloads > 65535;
before narrowing, check payload.size() <= std::numeric_limits<uint16_t>::max()
(or UINT16_MAX) and reject/handle oversized payloads (e.g., log error and
return/fail the RPC send) instead of casting. Update the code around the
header.payload_len assignment in rpc_client.cpp to perform this validation and
use the existing error/reporting mechanism (or throw) if the payload is too
large so you never store a truncated value in header.payload_len.

In `@src/network/rpc_server.cpp`:
- Around line 52-67: RpcServer::stop currently sets running_ = false and joins
accept_thread_ and each thread in worker_threads_ but worker threads in
handle_client remain blocked in recv(); to fix, ensure each client socket is
shutdown/closed before joining threads: track/iterate over client FDs (or store
connection wrappers associated with worker_threads_), call shutdown(fd,
SHUT_RDWR) and close(fd) (or socket::close equivalent) to unblock recv in
handle_client, then join accept_thread_ and worker_threads_, and finally clear
worker_threads_ and reset listen_fd_; update RpcServer::stop and ensure
handle_client checks running_ after recv returns to exit cleanly.

In `@tests/distributed_tests.cpp`:
- Around line 45-47: The test calls exec.execute(*stmt, "CREATE TABLE test (id
INT)") and only checks res.success(), which allows a silent no-op; update the
DDLRouting test to actually verify the side effect by asserting the table exists
after execution (e.g., invoke the appropriate catalog/table-existence check or
run a simple metadata/query like exec.execute("SELECT ... from
sqlite_master/...") and EXPECT_TRUE on that result). Locate the CREATE TABLE
invocation around exec.execute, stmt and res, then add a follow-up assertion
that confirms the "test" table is present (use the existing catalog API or exec
helper used elsewhere in tests).

In `@tests/server_tests.cpp`:
- Around line 102-118: The test currently hides connection failures and assumes
recv returns full frames; change the connect check to a hard test (use
ASSERT_EQ/ASSERT_TRUE on connect(...) == 0) so the test fails immediately if the
socket can't connect, and replace the direct recv calls with a read-exact loop
(or helper like read_exact) that repeatedly calls recv until the expected byte
count is received or an error/EOF occurs; apply this for both the 9-byte Auth
response and the 6-byte ReadyForQuery reads, and validate return values and
contents (check buffer[0] == 'R' and buffer[0] == 'Z') after the full frame is
read. Ensure you reference the existing variables/functions: connect, sock,
addr, startup/STARTUP_PKT_LEN, recv, buffer.

---

Outside diff comments:
In `@tests/server_tests.cpp`:
- Around line 1-124: Run clang-format -i on the test file that contains the
ServerTests (tests with TEST(ServerTests, StatusStrings), TEST(ServerTests,
Lifecycle) and TEST(ServerTests, Handshake)) to fix the formatting violations
reported by CI, then stage and commit the reformatted file so the formatting
gate passes; no code changes besides formatting should be made.

---

Minor comments:
In `@src/network/rpc_client.cpp`:
- Around line 1-103: The file fails clang-format; reformat the RpcClient
implementation to match the project's formatting rules (e.g., run clang-format
with the repo's config) so indentation, spacing, and brace placement are correct
across the RpcClient class and its methods (RpcClient::connect,
RpcClient::~RpcClient, RpcClient::disconnect, RpcClient::call,
RpcClient::send_only), ensuring includes, namespace closing, and function
signatures follow the configured style.

In `@src/network/rpc_server.cpp`:
- Around line 1-119: This file fails clang-format; reformat the translation unit
(rpc_server.cpp) to match project style by running clang-format (or your
project's format command) and committing the updated file so CI passes—apply
formatting across the entire file including functions RpcServer::start,
RpcServer::stop, RpcServer::set_handler, RpcServer::accept_loop and
RpcServer::handle_client, then re-run CI to ensure the clang-format check
succeeds.

In `@src/network/server.cpp`:
- Around line 95-107: Run clang-format -i on src/network/server.cpp to fix the
CI formatting failures; reformat the file so the constructor Server::Server(...)
initializer list and the factory Server::create(...) follow the project's
clang-format rules (ensure lines around Server::Server, Server::create, and the
regions mentioned at lines ~122, ~261, ~268 are reformatted). Commit the
reformatted file so CI formatting checks pass.

In `@tests/distributed_tests.cpp`:
- Around line 43-46: The test dereferences the result of
parser.parse_statement() without checking for null; add a guard asserting the
pointer is non-null before dereference by inserting ASSERT_NE(stmt, nullptr)
after auto stmt = parser.parse_statement(), so that the subsequent call
exec.execute(*stmt, "CREATE TABLE test (id INT)") is safe and the test fails
clearly if parsing returns null.

In `@tests/distributed_txn_tests.cpp`:
- Around line 36-39: The tests dereference the parsed statement pointer (*stmt)
without asserting parsing succeeded; add a null-check assertion (e.g.,
ASSERT_TRUE(stmt) or ASSERT_NE(stmt, nullptr)) immediately after calling
parser.parse_statement() before using *stmt in calls like exec.execute(*stmt,
"COMMIT") (apply the same fix for the other occurrences around the blocks with
parser.parse_statement() at the noted locations).

---

Duplicate comments:
In `@include/distributed/shard_manager.hpp`:
- Around line 26-33: compute_shard currently uses std::hash<std::string> which
is non-deterministic across builds; replace it with a deterministic FNV-1a
64-bit hash implementation inside compute_shard (use the bytes of
pk_value.to_string() or pk_value.data() if available), compute the 64-bit FNV-1a
hash, then take hash % num_shards and return as uint32_t; ensure the function
still handles num_shards==0 and casts the 64-bit result safely to uint32_t when
returning.

In `@include/network/rpc_message.hpp`:
- Around line 120-135: In QueryResultsReply::deserialize, avoid the OOB read and
non-portable integer parsing by first verifying in.size() >= 5 before copying
err_len and then verifying in.size() >= 5 + err_len + 4 before reading
row_count; replace the raw memcpy interpretation of err_len and row_count with
network-byte-order-aware conversion (e.g., read uint32_t from the byte stream
then apply ntohl/ntohl-equivalent) so values are normalized across hosts; update
the checks around reply.error_msg and reply.rows resizing to use these
validated, converted lengths and bail out or throw if sizes would overflow or
exceed in.size() bounds.

In `@include/network/rpc_server.hpp`:
- Around line 54-65: Add explicit client-lifecycle shutdown state and
coordination to allow stop() to unblock threads in recv(): introduce a
thread-safe container for active client fds (e.g., std::unordered_set<int>
client_fds_ with std::mutex client_fds_mutex_), an atomic<bool>
shutting_down_{false} (or similar) and a std::condition_variable shutdown_cv_ to
signal shutdown. Update accept_loop() to stop accepting when shutting_down_ is
set; register/deregister fds in handle_client() using client_fds_ and
periodically check shutting_down_ (or rely on shutdown_cv_) so the loop can
exit; implement stop() to set shutting_down_, lock client_fds_mutex_, close all
tracked fds to unblock recv(), notify shutdown_cv_, and join worker/accept
threads. Ensure handlers_ synchronization remains unchanged.

In `@src/distributed/distributed_executor.cpp`:
- Around line 60-73: Loop-captured variable `node` is captured by reference in
the std::async lambdas (e.g., the lambda pushed into prepare_futures), which is
unsafe; change the lambda captures to capture node by value (e.g., replace
[&node, payload] with [node, payload]) for the prepare_futures async call and do
the same for the other async lambdas referenced (the similar std::async calls
around lines 92-99 and 117-125) so each future holds its own copy of node
instead of referencing a loop variable.

In `@src/network/rpc_client.cpp`:
- Around line 58-73: RpcClient::call currently assumes single send()/recv()
complete transfers and reads the response outside any synchronization, which can
cause partial reads/writes and interleaved responses across concurrent calls;
fix by (1) implementing and using robust readn/writeN helpers (e.g.,
readFull/writeFull) that loop until the requested byte count is transferred or
an error/EOF occurs, and replace the direct recv(fd_, ...) calls in call (and
the similar block at the other occurrence) with readFull to fully read the
8-byte RpcHeader and then the full payload; and (2) serialize per-call socket
use by protecting the entire send+receive sequence in RpcClient::call (including
the send_only invocation or replace it with writeFull) with a mutex (e.g., a
member std::mutex) so concurrent calls cannot interleave on fd_. Ensure errors
from partial transfers propagate as failures from call().
- Around line 79-81: send_only() currently holds std::scoped_lock<std::mutex>
lock(mutex_) and then calls connect(), but connect() also locks mutex_, causing
a self-deadlock; fix by checking fd_ under the scoped_lock, then release the
lock before invoking connect() (e.g., drop the scoped_lock or limit its scope)
and after connect() re-acquire the mutex to re-check fd_/state as needed;
reference the send_only() function, the connect() method, the mutex_ member and
fd_ variable when making this change.

In `@src/network/rpc_server.cpp`:
- Around line 93-102: The recv calls reading the RPC frame are assuming full
reads; replace them with a robust "recv_all" style loop to ensure exactly 8
bytes are read into header_buf and then exactly header.payload_len bytes into
payload (or return/error if peer closes/short read). Update the code paths
around header_buf, RpcHeader::decode(header_buf), payload and client_fd to
call/use recv_all(client_fd, header_buf, 8) and recv_all(client_fd,
payload.data(), header.payload_len) (or equivalent helper that returns bytes
read or an error) and handle partial reads/EOF by breaking/closing the
connection before attempting to decode or process a truncated frame.

---

Nitpick comments:
In `@include/distributed/distributed_executor.hpp`:
- Around line 1-5: The file-level doc comment at the top of
distributed_executor.hpp contains a duplicated opener ("/*" or "/**") on the
second line; remove the extra "/**" so the header comment is a single
well-formed block comment describing the file (leaving the existing lines with
"@file distributed_executor.hpp" and "@brief High-level executor for distributed
queries" intact).

In `@include/distributed/shard_manager.hpp`:
- Around line 9-12: The header is not self-contained: it uses std::optional and
std::hash but only includes <string>, <vector>, <cstdint>; add direct includes
for <optional> and <functional> at the top of
include/distributed/shard_manager.hpp so symbols like std::optional and
std::hash resolve without relying on transitive includes (ensure the new
includes appear alongside the existing `#include` lines).

In `@tests/distributed_txn_tests.cpp`:
- Around line 107-109: Remove the fixed std::this_thread::sleep_for(...) calls
and rely on execute() to block until async phases complete; specifically delete
the sleep before the EXPECT_EQ(commit_count.load(), 2) check (and the analogous
sleep at the other occurrence) so the test asserts immediately after
txn.execute() or, if execute() cannot be relied on, replace the sleep with a
proper wait on commit_count using a small-timeout loop or condition_variable
tied to commit_count rather than a fixed sleep. Ensure references to
commit_count and execute() remain the synchronization points in the test.

In `@tests/raft_simulation_tests.cpp`:
- Around line 33-38: The test FollowerToCandidate currently only sleeps and
never asserts the node's post-election state; add a concrete expectation after
the sleep that verifies the node transitioned (e.g., assert node->getState() ==
State::Candidate || node->getState() == State::Leader). If no accessor exists,
expose the state for tests by adding a read-only accessor like
RaftNode::getState() (or make the state member test-visible) and use that in the
test assertion to fail when the transition did not occur.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 42e33ad and ea1e95a.

📒 Files selected for processing (20)

include/catalog/catalog.hpp
include/common/config.hpp
include/distributed/distributed_executor.hpp
include/distributed/raft_node.hpp
include/distributed/shard_manager.hpp
include/network/rpc_client.hpp
include/network/rpc_message.hpp
include/network/rpc_server.hpp
src/catalog/catalog.cpp
src/distributed/distributed_executor.cpp
src/distributed/raft_node.cpp
src/main.cpp
src/network/rpc_client.cpp
src/network/rpc_server.cpp
src/network/server.cpp
src/storage/storage_manager.cpp
tests/distributed_tests.cpp
tests/distributed_txn_tests.cpp
tests/raft_simulation_tests.cpp
tests/server_tests.cpp

✅ Files skipped from review due to trivial changes (1)

src/storage/storage_manager.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

include/catalog/catalog.hpp

coderabbitai · 2026-03-01T12:38:56Z

include/network/rpc_client.hpp

+
+    bool connect();
+    void disconnect();
+    bool is_connected() const { return fd_ >= 0; }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the RPC client files find . -type f -name "*rpc_client*" | head -20

Repository: poyrazK/cloudSQL

Length of output: 121

🏁 Script executed:

# Look at the header file to understand the structure cat -n include/network/rpc_client.hpp

Repository: poyrazK/cloudSQL

Length of output: 1487

🏁 Script executed:

# Find the implementation file find . -type f -name "*.cpp" -o -name "*.cc" | xargs grep -l "rpc_client\|RpcClient" | head -10

Repository: poyrazK/cloudSQL

Length of output: 163

🏁 Script executed:

# Now look at the implementation to see how fd_ is accessed cat -n src/network/rpc_client.cpp

Repository: poyrazK/cloudSQL

Length of output: 3155

Guard is_connected() with the mutex to avoid a data race.

Line 27 reads fd_ without synchronization, while connect() (line 25) and disconnect() (line 50) write fd_ under mutex_. Make is_connected() lock mutex_ (or make fd_ atomic) to avoid undefined behavior under concurrent access.

Suggested fix

- bool is_connected() const { return fd_ >= 0; } + bool is_connected() const { + const std::scoped_lock<std::mutex> lock(mutex_); + return fd_ >= 0; + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_client.hpp` at line 27, The is_connected() accessor reads fd_ without synchronization causing a data race against connect() and disconnect() which hold mutex_; fix by synchronizing access: either acquire the same mutex_ inside is_connected() (lock mutex_ and return fd_ >= 0) or change fd_ to std::atomic<int> and update connect()/disconnect() accordingly; reference is_connected(), connect(), disconnect(), fd_, and mutex_ to locate the changes and ensure all reads/writes use the chosen synchronization method.

coderabbitai · 2026-03-01T12:38:56Z

include/network/rpc_message.hpp

+    [[nodiscard]] std::vector<uint8_t> serialize() const {
+        std::vector<uint8_t> out(8);
+        std::memcpy(out.data(), &txn_id, 8);
+        return out;
+    }
+
+    static TxnOperationArgs deserialize(const std::vector<uint8_t>& in) {
+        TxnOperationArgs args;
+        if (in.size() >= 8) {
+            std::memcpy(&args.txn_id, in.data(), 8);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the specific file and lines mentioned cat -n include/network/rpc_message.hpp | head -170 | tail -30

Repository: poyrazK/cloudSQL

Length of output: 1024

🏁 Script executed:

# Search for endianness-related patterns and functions in the codebase rg -i "endian|hton|ntoh|byte.*order" --type-list | head -20

Repository: poyrazK/cloudSQL

Length of output: 449

🏁 Script executed:

# Look for the definition of txn_id and TxnOperationArgs rg "txn_id" include/network/rpc_message.hpp -B 5 -A 5

Repository: poyrazK/cloudSQL

Length of output: 579

🏁 Script executed:

# Search for other serialization methods to see if they handle endianness rg "serialize|deserialize" include/network/rpc_message.hpp -B 2 -A 10

Repository: poyrazK/cloudSQL

Length of output: 2617

🏁 Script executed:

# Search for endianness handling utilities in the codebase rg "hton|ntoh|endian|byte.*order" -i

Repository: poyrazK/cloudSQL

Length of output: 2103

🏁 Script executed:

# Check how TxnOperationArgs is used (network context) rg "TxnOperationArgs" -B 3 -A 3

Repository: poyrazK/cloudSQL

Length of output: 3518

🏁 Script executed:

# Look for any network or protocol specifications fd -e md -e txt | head -20

Repository: poyrazK/cloudSQL

Length of output: 134

🏁 Script executed:

# Check if there are any comments about distributed/multi-arch scenarios rg "distributed|architecture|endian|cluster" -i include/network/

Repository: poyrazK/cloudSQL

Length of output: 673

Normalize txn_id endianness in network payloads.

TxnOperationArgs currently copies host-endian uint64_t directly without byte order conversion. Use htonll() on serialize and ntohll() on deserialize to ensure transaction IDs remain consistent across mixed-architecture clusters. This same issue affects other numeric fields like QueryResultsReply.err_len and row_count.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_message.hpp` around lines 148 - 157, The code serializes txn_id in TxnOperationArgs by memcpy’ing host-endian uint64_t directly; update TxnOperationArgs::serialize to convert txn_id to network byte order with htonll() before writing into the output buffer and update TxnOperationArgs::deserialize to read the 8 bytes then convert back with ntohll() into args.txn_id; apply the same pattern to other numeric fields mentioned (e.g., QueryResultsReply::err_len and row_count) so all multi-byte integer fields use htonll/ntohll (or appropriate hton*/ntoh* for sizes) when serializing/deserializing.

coderabbitai · 2026-03-01T12:38:56Z

src/catalog/catalog.cpp

 oid_t Catalog::create_table(const std::string& table_name, std::vector<ColumnInfo> columns) {
+    if (raft_node_ != nullptr) {
+        /* TODO: Serialize DDL and replicate via Raft */
+        /* For now, just call local to keep it working during Step 4 implementation */
+        return create_table_local(table_name, std::move(columns));
+    }
+    return create_table_local(table_name, std::move(columns));
+}


⚠️ Potential issue | 🔴 Critical

Raft-enabled path still performs local DDL without replication.

When raft_node_ is set, returning local success here can split catalog state across nodes.

🚧 Safer interim behavior

oid_t Catalog::create_table(const std::string& table_name, std::vector<ColumnInfo> columns) { if (raft_node_ != nullptr) { - /* TODO: Serialize DDL and replicate via Raft */ - /* For now, just call local to keep it working during Step 4 implementation */ - return create_table_local(table_name, std::move(columns)); + throw std::runtime_error("Raft-replicated CREATE TABLE is not implemented yet"); } return create_table_local(table_name, std::move(columns)); }

Also applies to: 111-117

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/catalog/catalog.cpp` around lines 75 - 82, When raft_node_ is set the code currently calls create_table_local and returns local success without replicating via Raft; change Catalog::create_table so that if raft_node_ != nullptr it does not perform local-only DDL but instead returns/throws an explicit "not supported in raft mode" error (or logs and aborts the operation) until the Raft serialization/replication is implemented, rather than calling create_table_local; apply the same change pattern to the other DDL paths that currently bypass replication (e.g., any other methods that check raft_node_ and call create_*_local or drop_*_local).

coderabbitai · 2026-03-01T12:38:56Z

src/catalog/catalog.cpp

+    /* Basic Shard Assignment */
+    ShardInfo shard;
+    shard.shard_id = 0;
+    shard.node_address = "127.0.0.1";  // Default
+    shard.port = 6432;
+    table->shards.push_back(shard);


⚠️ Potential issue | 🟠 Major

Default shard assignment is hard-coded to localhost.

All tables are pinned to 127.0.0.1:6432, which is unsafe for real cluster routing and failover behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/catalog/catalog.cpp` around lines 95 - 100, The code currently hard-codes a default shard to localhost by creating a ShardInfo with shard.shard_id = 0, shard.node_address = "127.0.0.1", shard.port = 6432 and pushing it into table->shards; replace this static assignment with a configurable or discovery-backed default: read the shard address/port from configuration or environment (or call an existing helper like getLocalAddress()/getClusterConfig() if available) and populate ShardInfo from those values, falling back to a safe sentinel (e.g., empty node_address or INADDR_ANY/0) rather than "127.0.0.1"; keep the same symbols (ShardInfo, shard.shard_id, shard.node_address, shard.port, table->shards.push_back) so changes are localized.

coderabbitai · 2026-03-01T12:38:56Z

src/distributed/distributed_executor.cpp

+    // For simplicity, we assume a single active global transaction ID.
+    constexpr uint64_t GLOBAL_TXN_ID = 1;
+
+    if (type == parser::StmtType::TransactionCommit) {
+        std::string errors;
+
+        network::TxnOperationArgs args;
+        args.txn_id = GLOBAL_TXN_ID;
+        auto payload = args.serialize();


⚠️ Potential issue | 🟠 Major

Fixed GLOBAL_TXN_ID = 1 is unsafe for concurrent transactions.

All commits/rollbacks are multiplexed onto one ID, which can cross-wire independent transactions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/distributed/distributed_executor.cpp` around lines 47 - 55, The code incorrectly hardcodes GLOBAL_TXN_ID = 1 causing cross-wiring of independent transactions; instead remove the constexpr and get the real transaction id for this operation (e.g., from the execution context or transaction manager) and assign it to network::TxnOperationArgs::txn_id before serialize()—for example, obtain currentTxnId via your TransactionManager/ExecutionContext API (or call a TransactionIdGenerator/BeginTxn helper) and use that id where parser::StmtType::TransactionCommit is handled so each commit/rollback uses the correct unique txn id rather than GLOBAL_TXN_ID.

coderabbitai · 2026-03-01T12:38:57Z

src/main.cpp

+                rpc_server->set_handler(
+                    cloudsql::network::RpcType::TxnPrepare,
+                    [&](const cloudsql::network::RpcHeader& h, const std::vector<uint8_t>& p,
+                        int fd) {
+                        (void)h;
+                        auto args = cloudsql::network::TxnOperationArgs::deserialize(p);
+                        cloudsql::network::QueryResultsReply reply;
+                        try {
+                            // In a full implementation, we'd find the txn by ID and flush its WAL.
+                            // For now, we just force a flush.
+                            log_manager->flush(true);
+                            reply.success = true;
+                        } catch (const std::exception& e) {
+                            reply.success = false;
+                            reply.error_msg = e.what();
+                        }
+
+                        auto resp_p = reply.serialize();
+                        cloudsql::network::RpcHeader resp_h;
+                        resp_h.type = cloudsql::network::RpcType::QueryResults;
+                        resp_h.payload_len = static_cast<uint16_t>(resp_p.size());
+                        char h_buf[8];
+                        resp_h.encode(h_buf);
+                        send(fd, h_buf, 8, 0);
+                        send(fd, resp_p.data(), resp_p.size(), 0);
+                    });
+
+                rpc_server->set_handler(
+                    cloudsql::network::RpcType::TxnCommit,
+                    [&](const cloudsql::network::RpcHeader& h, const std::vector<uint8_t>& p,
+                        int fd) {
+                        (void)h;
+                        auto args = cloudsql::network::TxnOperationArgs::deserialize(p);
+                        cloudsql::network::QueryResultsReply reply;
+                        try {
+                            auto txn = transaction_manager.get_transaction(args.txn_id);
+                            if (txn) {
+                                transaction_manager.commit(txn);
+                            }
+                            reply.success = true;
+                        } catch (const std::exception& e) {
+                            reply.success = false;
+                            reply.error_msg = e.what();
+                        }
+
+                        auto resp_p = reply.serialize();
+                        cloudsql::network::RpcHeader resp_h;
+                        resp_h.type = cloudsql::network::RpcType::QueryResults;
+                        resp_h.payload_len = static_cast<uint16_t>(resp_p.size());
+                        char h_buf[8];
+                        resp_h.encode(h_buf);
+                        send(fd, h_buf, 8, 0);
+                        send(fd, resp_p.data(), resp_p.size(), 0);
+                    });
+
+                rpc_server->set_handler(
+                    cloudsql::network::RpcType::TxnAbort,
+                    [&](const cloudsql::network::RpcHeader& h, const std::vector<uint8_t>& p,
+                        int fd) {
+                        (void)h;
+                        auto args = cloudsql::network::TxnOperationArgs::deserialize(p);
+                        cloudsql::network::QueryResultsReply reply;
+                        try {
+                            auto txn = transaction_manager.get_transaction(args.txn_id);
+                            if (txn) {
+                                transaction_manager.abort(txn);
+                            }
+                            reply.success = true;
+                        } catch (const std::exception& e) {
+                            reply.success = false;
+                            reply.error_msg = e.what();
+                        }
+
+                        auto resp_p = reply.serialize();
+                        cloudsql::network::RpcHeader resp_h;
+                        resp_h.type = cloudsql::network::RpcType::QueryResults;
+                        resp_h.payload_len = static_cast<uint16_t>(resp_p.size());
+                        char h_buf[8];
+                        resp_h.encode(h_buf);
+                        send(fd, h_buf, 8, 0);
+                        send(fd, resp_p.data(), resp_p.size(), 0);
+                    });


⚠️ Potential issue | 🟠 Major

2PC handlers currently acknowledge success for unknown transactions.

Line 280 in TxnPrepare is a WAL flush-only success path and does not validate/prepare the transaction.
Lines 305-337 in TxnCommit/TxnAbort also return success when get_transaction(args.txn_id) is null. This breaks 2PC guarantees.

✅ Suggested fix

rpc_server->set_handler( cloudsql::network::RpcType::TxnPrepare, [&](const cloudsql::network::RpcHeader& h, const std::vector<uint8_t>& p, int fd) { (void)h; auto args = cloudsql::network::TxnOperationArgs::deserialize(p); cloudsql::network::QueryResultsReply reply; try { - // In a full implementation, we'd find the txn by ID and flush its WAL. - // For now, we just force a flush. - log_manager->flush(true); - reply.success = true; + auto txn = transaction_manager.get_transaction(args.txn_id); + if (!txn) { + reply.success = false; + reply.error_msg = "Unknown transaction id"; + } else { + transaction_manager.prepare(txn); + log_manager->flush(true); + reply.success = true; + } } catch (const std::exception& e) { reply.success = false; reply.error_msg = e.what(); } @@ try { auto txn = transaction_manager.get_transaction(args.txn_id); - if (txn) { - transaction_manager.commit(txn); - } - reply.success = true; + if (!txn) { + reply.success = false; + reply.error_msg = "Unknown transaction id"; + } else { + transaction_manager.commit(txn); + reply.success = true; + } } catch (const std::exception& e) { @@ try { auto txn = transaction_manager.get_transaction(args.txn_id); - if (txn) { - transaction_manager.abort(txn); - } - reply.success = true; + if (!txn) { + reply.success = false; + reply.error_msg = "Unknown transaction id"; + } else { + transaction_manager.abort(txn); + reply.success = true; + } } catch (const std::exception& e) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/main.cpp` around lines 270 - 351, The 2PC RPC handlers currently assume success even when the transaction doesn't exist; update the TxnPrepare, TxnCommit, and TxnAbort handlers to validate the transaction returned by transaction_manager.get_transaction(args.txn_id) and return failure if it's null (set reply.success = false and a descriptive reply.error_msg). In the TxnPrepare handler (RpcType::TxnPrepare) also perform the real prepare operation on the found txn (e.g., call transaction_manager.prepare(txn) or the equivalent prepare method) before setting reply.success = true, and catch/propagate errors from that prepare call; for TxnCommit and TxnAbort only call transaction_manager.commit(txn) / transaction_manager.abort(txn) when txn is non-null and otherwise set failure with an error message.

coderabbitai · 2026-03-01T12:38:57Z

src/network/rpc_client.cpp

+
+    RpcHeader header;
+    header.type = type;
+    header.payload_len = static_cast<uint16_t>(payload.size());


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the file exists and get an overview fd "rpc_client.cpp" -t f

Repository: poyrazK/cloudSQL

Length of output: 86

🏁 Script executed:

# Read the relevant section of the file with context around line 86 wc -l src/network/rpc_client.cpp

Repository: poyrazK/cloudSQL

Length of output: 90

🏁 Script executed:

# Get the context around line 86 sed -n '70,110p' src/network/rpc_client.cpp

Repository: poyrazK/cloudSQL

Length of output: 885

🏁 Script executed:

# Get more context - find the function that contains line 86 sed -n '1,100p' src/network/rpc_client.cpp | head -100

Repository: poyrazK/cloudSQL

Length of output: 2397

🏁 Script executed:

# Find the header file with RpcHeader definition fd "rpc" -t f | head -20

Repository: poyrazK/cloudSQL

Length of output: 207

🏁 Script executed:

# Search for RpcHeader definition rg "struct RpcHeader|class RpcHeader" --type cpp

Repository: poyrazK/cloudSQL

Length of output: 110

🏁 Script executed:

# Look for the header file fd "rpc_client.hpp" -t f

Repository: poyrazK/cloudSQL

Length of output: 90

🏁 Script executed:

# Read the header file to understand RpcHeader cat src/network/rpc_client.hpp

Repository: poyrazK/cloudSQL

Length of output: 118

🏁 Script executed:

# Read the RpcHeader definition cat include/network/rpc_message.hpp

Repository: poyrazK/cloudSQL

Length of output: 4954

🏁 Script executed:

# Read the client header cat include/network/rpc_client.hpp

Repository: poyrazK/cloudSQL

Length of output: 1144

Validate payload length before narrowing to uint16_t.

The direct cast of payload.size() to uint16_t on line 86 silently truncates payloads exceeding 65535 bytes, corrupting RPC message framing. Add validation to reject oversized payloads:

Suggested guard

+ if (payload.size() > std::numeric_limits<uint16_t>::max()) { + return false; + } header.payload_len = static_cast<uint16_t>(payload.size());

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

header.payload_len = static_cast<uint16_t>(payload.size());

if (payload.size() > std::numeric_limits<uint16_t>::max()) {

return false;

}

header.payload_len = static_cast<uint16_t>(payload.size());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_client.cpp` at line 86, The assignment header.payload_len = static_cast<uint16_t>(payload.size()) can silently truncate payloads > 65535; before narrowing, check payload.size() <= std::numeric_limits<uint16_t>::max() (or UINT16_MAX) and reject/handle oversized payloads (e.g., log error and return/fail the RPC send) instead of casting. Update the code around the header.payload_len assignment in rpc_client.cpp to perform this validation and use the existing error/reporting mechanism (or throw) if the payload is too large so you never store a truncated value in header.payload_len.

coderabbitai · 2026-03-01T12:38:57Z

src/network/rpc_server.cpp

+void RpcServer::stop() {
+    running_ = false;
+    if (listen_fd_ >= 0) {
+        static_cast<void>(close(listen_fd_));
+        listen_fd_ = -1;
+    }
+    if (accept_thread_.joinable()) {
+        accept_thread_.join();
+    }
+    for (auto& t : worker_threads_) {
+        if (t.joinable()) {
+            t.join();
+        }
+    }
+    worker_threads_.clear();
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Locate and examine the RpcServer implementation find . -name "rpc_server.cpp" -o -name "rpc_server.h" | head -20

Repository: poyrazK/cloudSQL

Length of output: 88

🏁 Script executed:

# Get an overview of the file structure git ls-files | grep -E "(rpc_server|RpcServer)" | head -20

Repository: poyrazK/cloudSQL

Length of output: 117

🏁 Script executed:

# Check file sizes wc -l src/network/rpc_server.cpp include/network/rpc_server.hpp

Repository: poyrazK/cloudSQL

Length of output: 138

🏁 Script executed:

# Read the full implementation file cat -n src/network/rpc_server.cpp

Repository: poyrazK/cloudSQL

Length of output: 3898

🏁 Script executed:

# Read the header file cat -n include/network/rpc_server.hpp

Repository: poyrazK/cloudSQL

Length of output: 2179

stop() blocks indefinitely on worker thread joins due to blocked recv() calls.

Worker threads in handle_client() may be blocked in recv() calls (lines 93, 101) waiting for client data. When stop() is called, setting running_=false doesn't unblock these threads—they will remain blocked until data arrives or the socket times out. This causes stop() to hang at lines 61-64 waiting for the threads to join. Close or shutdown client sockets before joining worker threads to ensure graceful termination.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_server.cpp` around lines 52 - 67, RpcServer::stop currently sets running_ = false and joins accept_thread_ and each thread in worker_threads_ but worker threads in handle_client remain blocked in recv(); to fix, ensure each client socket is shutdown/closed before joining threads: track/iterate over client FDs (or store connection wrappers associated with worker_threads_), call shutdown(fd, SHUT_RDWR) and close(fd) (or socket::close equivalent) to unblock recv in handle_client, then join accept_thread_ and worker_threads_, and finally clear worker_threads_ and reset listen_fd_; update RpcServer::stop and ensure handle_client checks running_ after recv returns to exit cleanly.

coderabbitai · 2026-03-01T12:38:57Z

tests/distributed_tests.cpp

+    auto res = exec.execute(*stmt, "CREATE TABLE test (id INT)");
+    EXPECT_TRUE(res.success());
+}


⚠️ Potential issue | 🟠 Major

DDLRouting test does not verify the side effect.

This currently passes on a silent no-op. Please assert table existence after execution.

✅ Proposed assertion

auto res = exec.execute(*stmt, "CREATE TABLE test (id INT)"); EXPECT_TRUE(res.success()); + auto table = catalog->get_table_by_name("test"); + EXPECT_TRUE(table.has_value());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/distributed_tests.cpp` around lines 45 - 47, The test calls exec.execute(*stmt, "CREATE TABLE test (id INT)") and only checks res.success(), which allows a silent no-op; update the DDLRouting test to actually verify the side effect by asserting the table exists after execution (e.g., invoke the appropriate catalog/table-existence check or run a simple metadata/query like exec.execute("SELECT ... from sqlite_master/...") and EXPECT_TRUE on that result). Locate the CREATE TABLE invocation around exec.execute, stmt and res, then add a follow-up assertion that confirms the "test" table is present (use the existing catalog API or exec helper used elsewhere in tests).

coderabbitai · 2026-03-01T12:38:57Z

tests/server_tests.cpp

+    if (connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)) == 0) {
+        // Send startup packet
+        const std::array<uint32_t, 2> startup = {htonl(static_cast<uint32_t>(STARTUP_PKT_LEN)),
+                                                 htonl(196608)};
+        send(sock, startup.data(), startup.size() * 4, 0);
+
+        // Receive Auth OK
+        std::array<char, 9> buffer{};
+        ssize_t n = recv(sock, buffer.data(), 9, 0);
+        EXPECT_EQ(n, 9);
+        EXPECT_EQ(buffer[0], 'R');
+
+        // Receive ReadyForQuery
+        n = recv(sock, buffer.data(), 6, 0);
+        EXPECT_EQ(n, 6);
+        EXPECT_EQ(buffer[0], 'Z');
    }


⚠️ Potential issue | 🟠 Major

Handshake test currently allows false positives and flaky reads.

Line 102 gates all assertions behind if (connect(...) == 0), so the test can pass even when no connection is established. Also, Line 110 and Line 115 assume recv returns full frames in one call, which is not guaranteed for TCP streams.

✅ Suggested fix

- if (connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)) == 0) { - // Send startup packet - const std::array<uint32_t, 2> startup = {htonl(static_cast<uint32_t>(STARTUP_PKT_LEN)), - htonl(196608)}; - send(sock, startup.data(), startup.size() * 4, 0); - - // Receive Auth OK - std::array<char, 9> buffer{}; - ssize_t n = recv(sock, buffer.data(), 9, 0); - EXPECT_EQ(n, 9); - EXPECT_EQ(buffer[0], 'R'); - - // Receive ReadyForQuery - n = recv(sock, buffer.data(), 6, 0); - EXPECT_EQ(n, 6); - EXPECT_EQ(buffer[0], 'Z'); - } + ASSERT_EQ(connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)), 0); + + const std::array<uint32_t, 2> startup = {htonl(static_cast<uint32_t>(STARTUP_PKT_LEN)), + htonl(196608)}; + ASSERT_EQ(send(sock, startup.data(), startup.size() * 4, 0), + static_cast<ssize_t>(startup.size() * 4)); + + auto recv_exact = [&](char* dst, size_t len) -> bool { + size_t got = 0; + while (got < len) { + ssize_t n = recv(sock, dst + got, len - got, 0); + if (n <= 0) return false; + got += static_cast<size_t>(n); + } + return true; + }; + + std::array<char, 9> buffer{}; + ASSERT_TRUE(recv_exact(buffer.data(), 9)); + EXPECT_EQ(buffer[0], 'R'); + + ASSERT_TRUE(recv_exact(buffer.data(), 6)); + EXPECT_EQ(buffer[0], 'Z');

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/server_tests.cpp` around lines 102 - 118, The test currently hides connection failures and assumes recv returns full frames; change the connect check to a hard test (use ASSERT_EQ/ASSERT_TRUE on connect(...) == 0) so the test fails immediately if the socket can't connect, and replace the direct recv calls with a read-exact loop (or helper like read_exact) that repeatedly calls recv until the expected byte count is received or an error/EOF occurs; apply this for both the 9-byte Auth response and the 6-byte ReadyForQuery reads, and validate return values and contents (check buffer[0] == 'R' and buffer[0] == 'Z') after the full frame is read. Ensure you reference the existing variables/functions: connect, sock, addr, startup/STARTUP_PKT_LEN, recv, buffer.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

tests/server_tests.cpp (1)

102-118: ⚠️ Potential issue | 🟠 Major

Handshake test can silently pass without establishing connection.

The if (connect(...) == 0) pattern at Line 102 allows the test to pass even when no connection is made. Additionally, recv() calls at Lines 110 and 115 assume full protocol frames are returned in a single call, which TCP does not guarantee.

Replace if with ASSERT_EQ for the connect call and use a read-exact helper to ensure complete frame reception.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server_tests.cpp` around lines 102 - 118, The test currently uses if
(connect(... ) == 0) which can silently skip assertions; replace that
conditional with an assertion (e.g., ASSERT_EQ(connect(sock,
reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)), 0)) so the test fails
on connect error, and replace the direct recv(...) calls with a small helper
(e.g., readExact(sock, void* buf, size_t len)) that loops until exactly the
requested number of bytes are read or fails; use that helper for both the 9-byte
Auth frame and the 6-byte ReadyForQuery frame (asserting the return equals the
expected length) before checking buffer[0] for 'R' and 'Z', and keep use of
STARTUP_PKT_LEN and buffer unchanged.

src/network/rpc_client.cpp (1)

96-102: ⚠️ Potential issue | 🟠 Major

Partial send() not handled.

The send() calls at lines 96 and 100 may return fewer bytes than requested on a non-blocking or interrupted socket. Both should loop until all bytes are sent.

🔧 Proposed fix with send_all helper

+namespace {
+ssize_t send_all(int fd, const void* buf, size_t len) {
+    size_t total = 0;
+    const auto* data = static_cast<const char*>(buf);
+    while (total < len) {
+        const ssize_t n = send(fd, data + total, len - total, 0);
+        if (n <= 0) return n;
+        total += static_cast<size_t>(n);
+    }
+    return static_cast<ssize_t>(total);
+}
+}  // namespace

-    if (send(fd_, header_buf, 8, 0) <= 0) {
+    if (send_all(fd_, header_buf, 8) != 8) {
         return false;
     }
     if (!payload.empty()) {
-        if (send(fd_, payload.data(), payload.size(), 0) <= 0) {
+        if (send_all(fd_, payload.data(), payload.size()) !=
+            static_cast<ssize_t>(payload.size())) {
             return false;
         }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/network/rpc_client.cpp` around lines 96 - 102, The send() calls can
return partial writes; implement a send_all helper (e.g., ssize_t send_all(int
fd, const char* buf, size_t len, int flags)) that loops until all bytes are sent
or a non-recoverable error occurs (retry on EINTR, handle EAGAIN/EWOULDBLOCK
appropriately or propagate failure). Replace the direct calls that use fd_,
header_buf, payload.data()/payload.size() with send_all to ensure the full
8-byte header and the full payload are transmitted, returning false on
unrecoverable errors and true when all bytes are sent.

🧹 Nitpick comments (3)

tests/server_tests.cpp (3)

65-65: Consider checking socket creation for failure.

socket() returns -1 on failure. While unlikely in tests, checking this would prevent misleading failures.

🛡️ Proposed fix

-    int sock = socket(AF_INET, SOCK_STREAM, 0);
+    int sock = socket(AF_INET, SOCK_STREAM, 0);
+    ASSERT_NE(sock, -1) << "Failed to create socket";

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/server_tests.cpp` at line 65, The socket() call that assigns to
variable 'sock' should be checked for failure (socket() returns -1) — after int
sock = socket(AF_INET, SOCK_STREAM, 0); add a check for sock == -1 and handle
the error (e.g., call perror or log the error and fail the test / return early)
to avoid misleading test failures; ensure any allocated resources are cleaned up
before exiting the test.

106-106: send() return value is discarded.

The return value of send() should be checked to ensure the full startup packet was transmitted. This helps diagnose failures when the handshake doesn't proceed as expected.

🛡️ Proposed fix

-        send(sock, startup.data(), startup.size() * 4, 0);
+        ssize_t sent = send(sock, startup.data(), startup.size() * 4, 0);
+        ASSERT_EQ(sent, static_cast<ssize_t>(startup.size() * 4));

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/server_tests.cpp` at line 106, The call to send(sock, startup.data(),
startup.size() * 4, 0) discards its return value so partial sends or errors go
unnoticed; replace it with a send-loop that tracks bytes_sent versus total_len
(total_len = startup.size() * 4), repeatedly calling send on the buffer pointer
(startup.data() + bytes_sent) until bytes_sent == total_len, and on each
iteration handle -1 return values by checking errno for EINTR (retry) or other
errors (fail the test and log/ASSERT with errno), referencing the existing
variables sock and startup and the send call to locate where to add the loop and
error handling.

96-96: Check socket creation for failure.

Similar to the Lifecycle test, socket() can return -1 on failure.

🛡️ Proposed fix

     int sock = socket(AF_INET, SOCK_STREAM, 0);
+    ASSERT_NE(sock, -1) << "Failed to create socket";

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/server_tests.cpp` at line 96, The socket creation call int sock =
socket(AF_INET, SOCK_STREAM, 0); does not check for failure; update the test to
verify the returned descriptor is valid (i.e., not -1) immediately after
socket() (using the test framework's assert/expect) and handle the error path
(fail the test or return) so subsequent calls don't use an invalid sock
variable; reference the 'sock' variable and the socket(...) call in
tests/server_tests.cpp when adding the check.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/network/rpc_message.hpp`:
- Around line 139-143: The CI failed due to formatting around the end of the RPC
message class; run clang-format -i on the header containing the RpcMessage class
and fix whitespace/brace placement around the method that returns reply (the
function with "return reply;") so the closing braces and semicolon are formatted
per project style, then commit the formatted file.

---

Duplicate comments:
In `@src/network/rpc_client.cpp`:
- Around line 96-102: The send() calls can return partial writes; implement a
send_all helper (e.g., ssize_t send_all(int fd, const char* buf, size_t len, int
flags)) that loops until all bytes are sent or a non-recoverable error occurs
(retry on EINTR, handle EAGAIN/EWOULDBLOCK appropriately or propagate failure).
Replace the direct calls that use fd_, header_buf, payload.data()/payload.size()
with send_all to ensure the full 8-byte header and the full payload are
transmitted, returning false on unrecoverable errors and true when all bytes are
sent.

In `@tests/server_tests.cpp`:
- Around line 102-118: The test currently uses if (connect(... ) == 0) which can
silently skip assertions; replace that conditional with an assertion (e.g.,
ASSERT_EQ(connect(sock, reinterpret_cast<struct sockaddr*>(&addr),
sizeof(addr)), 0)) so the test fails on connect error, and replace the direct
recv(...) calls with a small helper (e.g., readExact(sock, void* buf, size_t
len)) that loops until exactly the requested number of bytes are read or fails;
use that helper for both the 9-byte Auth frame and the 6-byte ReadyForQuery
frame (asserting the return equals the expected length) before checking
buffer[0] for 'R' and 'Z', and keep use of STARTUP_PKT_LEN and buffer unchanged.

---

Nitpick comments:
In `@tests/server_tests.cpp`:
- Line 65: The socket() call that assigns to variable 'sock' should be checked
for failure (socket() returns -1) — after int sock = socket(AF_INET,
SOCK_STREAM, 0); add a check for sock == -1 and handle the error (e.g., call
perror or log the error and fail the test / return early) to avoid misleading
test failures; ensure any allocated resources are cleaned up before exiting the
test.
- Line 106: The call to send(sock, startup.data(), startup.size() * 4, 0)
discards its return value so partial sends or errors go unnoticed; replace it
with a send-loop that tracks bytes_sent versus total_len (total_len =
startup.size() * 4), repeatedly calling send on the buffer pointer
(startup.data() + bytes_sent) until bytes_sent == total_len, and on each
iteration handle -1 return values by checking errno for EINTR (retry) or other
errors (fail the test and log/ASSERT with errno), referencing the existing
variables sock and startup and the send call to locate where to add the loop and
error handling.
- Line 96: The socket creation call int sock = socket(AF_INET, SOCK_STREAM, 0);
does not check for failure; update the test to verify the returned descriptor is
valid (i.e., not -1) immediately after socket() (using the test framework's
assert/expect) and handle the error path (fail the test or return) so subsequent
calls don't use an invalid sock variable; reference the 'sock' variable and the
socket(...) call in tests/server_tests.cpp when adding the check.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea1e95a and 954b6ec.

📒 Files selected for processing (6)

include/distributed/distributed_executor.hpp
include/network/rpc_message.hpp
src/network/rpc_client.cpp
src/network/rpc_server.cpp
src/network/server.cpp
tests/server_tests.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

include/distributed/distributed_executor.hpp

include/network/rpc_message.hpp

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (5)

src/main.cpp (3)

271-283: ⚠️ Potential issue | 🟠 Major

2PC handlers still acknowledge success for unknown transactions.

Line 280-283 TxnPrepare does not validate/prepare by txn_id, and Line 307-311 plus Line 335-339 return success when get_transaction(...) is null. This breaks 2PC correctness.

✅ Suggested fix

                 rpc_server->set_handler(
                     cloudsql::network::RpcType::TxnPrepare,
                     [&](const cloudsql::network::RpcHeader& h, const std::vector<uint8_t>& p,
                         int fd) {
                         (void)h;
                         auto args = cloudsql::network::TxnOperationArgs::deserialize(p);
-                        (void)args;
                         cloudsql::network::QueryResultsReply reply;
                         try {
-                            // In a full implementation, we'd find the txn by ID and flush its WAL.
-                            // For now, we just force a flush.
-                            log_manager->flush(true);
-                            reply.success = true;
+                            auto txn = transaction_manager.get_transaction(args.txn_id);
+                            if (!txn) {
+                                reply.success = false;
+                                reply.error_msg = "Unknown transaction id";
+                            } else {
+                                transaction_manager.prepare(txn);
+                                log_manager->flush(true);
+                                reply.success = true;
+                            }
                         } catch (const std::exception& e) {
                             reply.success = false;
                             reply.error_msg = e.what();
                         }
@@
                         try {
                             auto txn = transaction_manager.get_transaction(args.txn_id);
-                            if (txn) {
-                                transaction_manager.commit(txn);
-                            }
-                            reply.success = true;
+                            if (!txn) {
+                                reply.success = false;
+                                reply.error_msg = "Unknown transaction id";
+                            } else {
+                                transaction_manager.commit(txn);
+                                reply.success = true;
+                            }
                         } catch (const std::exception& e) {
@@
                         try {
                             auto txn = transaction_manager.get_transaction(args.txn_id);
-                            if (txn) {
-                                transaction_manager.abort(txn);
-                            }
-                            reply.success = true;
+                            if (!txn) {
+                                reply.success = false;
+                                reply.error_msg = "Unknown transaction id";
+                            } else {
+                                transaction_manager.abort(txn);
+                                reply.success = true;
+                            }
                         } catch (const std::exception& e) {

Also applies to: 299-312, 327-340

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.cpp` around lines 271 - 283, The TxnPrepare RPC handler
(RpcType::TxnPrepare) and the other 2PC handlers must not acknowledge success
for unknown transactions: fetch and validate the transaction by txn_id (using
get_transaction(...)) before flushing/applying any work, and if
get_transaction(...) returns null/none return a failure reply (set reply.success
= false and include a descriptive error/errno in the reply) instead of
proceeding; update the handlers referenced (the TxnPrepare lambda and the other
2PC lambdas around where get_transaction(...) is used) to early-return on a null
transaction and only perform log_manager->flush or other commit/prepare actions
when the transaction exists.

150-158: ⚠️ Potential issue | 🟠 Major

Reject invalid --mode values instead of silently falling back to Standalone.

Line 156 currently maps any unknown mode string to RunMode::Standalone, which can hide operator typos and boot the wrong role.

✅ Suggested fix

             } else if ((arg == "-m" || arg == "--mode") && i + 1 < cmd_args.size()) {
                 const std::string& mode_str = cmd_args[++i];
                 if (mode_str == "coordinator" || mode_str == "distributed") {
                     config.mode = cloudsql::config::RunMode::Coordinator;
                 } else if (mode_str == "data") {
                     config.mode = cloudsql::config::RunMode::Data;
+                } else if (mode_str == "standalone") {
+                    config.mode = cloudsql::config::RunMode::Standalone;
                 } else {
-                    config.mode = cloudsql::config::RunMode::Standalone;
+                    std::cerr << "Invalid mode: " << mode_str
+                              << " (expected: standalone|coordinator|data)\n";
+                    return 1;
                 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.cpp` around lines 150 - 158, The command-line parsing currently maps
any unrecognized --mode value to cloudsql::config::RunMode::Standalone; change
this to treat unknown values as an error: in the arg-handling block that reads
mode_str (the branch that sets config.mode), validate that mode_str is exactly
"coordinator"/"distributed" or "data" and if not, emit a clear error message
(including the invalid mode), print usage/help, and exit with a non-zero status
instead of assigning RunMode::Standalone; update any tests or callers expecting
the old fallback behavior if necessary.

260-267: ⚠️ Potential issue | 🔴 Critical

Harden RPC writes: handle partial sends, SIGPIPE/EPIPE, and payload length overflow.

Line 263/292/320/348 narrows payload size to uint16_t without bounds checks, and Line 266/267 etc. ignore send() return values. This can truncate replies or corrupt frames under partial writes/disconnects.

✅ Suggested fix direction

+#include <cerrno>
+#include <limits>
@@
+auto send_all = [](int fd, const uint8_t* data, size_t len) -> bool {
+    size_t off = 0;
+    while (off < len) {
+        ssize_t n = send(fd, data + off, len - off, MSG_NOSIGNAL);
+        if (n < 0 && errno == EINTR) continue;
+        if (n <= 0) return false;
+        off += static_cast<size_t>(n);
+    }
+    return true;
+};
@@
-                        resp_h.payload_len = static_cast<uint16_t>(resp_p.size());
+                        if (resp_p.size() > std::numeric_limits<uint16_t>::max()) {
+                            return;  // or send explicit protocol error
+                        }
+                        resp_h.payload_len = static_cast<uint16_t>(resp_p.size());
                         char h_buf[8];
                         resp_h.encode(h_buf);
-                        send(fd, h_buf, 8, 0);
-                        send(fd, resp_p.data(), resp_p.size(), 0);
+                        if (!send_all(fd, reinterpret_cast<const uint8_t*>(h_buf), sizeof(h_buf))) return;
+                        if (!send_all(fd, resp_p.data(), resp_p.size())) return;

#!/bin/bash
set -euo pipefail

echo "== payload_len narrow casts in src/main.cpp =="
rg -n -C2 'payload_len\s*=\s*static_cast<uint16_t>\s*\(' src/main.cpp

echo
echo "== raw send() callsites in src/main.cpp =="
rg -n -C2 '\bsend\s*\(' src/main.cpp

echo
echo "== SIGPIPE / no-signal protections in src/main.cpp =="
rg -n -C2 'MSG_NOSIGNAL|SO_NOSIGPIPE|SIGPIPE' src/main.cpp

Also applies to: 289-296, 317-324, 345-352

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.cpp` around lines 260 - 267, The response write currently narrows
resp_p.size() into resp_h.payload_len (uint16_t) and performs raw send(fd, ...)
calls without checking return values or handling partial writes or
SIGPIPE/EPIPE; to fix, validate that resp_p.size() fits in uint16_t before the
static_cast and return/handle an error if it overflows, then replace the two raw
send() calls around resp_h.encode(...) and resp_p.data() with a robust write
loop (e.g., a send_all helper used by the code) that checks send()’s return
value, handles short writes by advancing the buffer and retrying until all bytes
are sent, and detects/EPIPE or other fatal errors to stop and close the fd; also
suppress SIGPIPE for these sends (use MSG_NOSIGNAL on send or ensure SIGPIPE is
ignored) and surface/log errors where appropriate (refer to resp_p, resp_h,
payload_len, encode, and the send(fd, ...) callsites).

include/network/rpc_message.hpp (2)

150-160: ⚠️ Potential issue | 🟠 Major

Normalize txn_id to network byte order in TxnOperationArgs.

Lines 152 and 159 copy host-endian uint64_t directly, so txn_id can be decoded incorrectly across mixed-endian nodes.

Proposed fix

 [[nodiscard]] std::vector<uint8_t> serialize() const {
     std::vector<uint8_t> out(8);
-    std::memcpy(out.data(), &txn_id, 8);
+    out[0] = static_cast<uint8_t>((txn_id >> 56) & 0xFF);
+    out[1] = static_cast<uint8_t>((txn_id >> 48) & 0xFF);
+    out[2] = static_cast<uint8_t>((txn_id >> 40) & 0xFF);
+    out[3] = static_cast<uint8_t>((txn_id >> 32) & 0xFF);
+    out[4] = static_cast<uint8_t>((txn_id >> 24) & 0xFF);
+    out[5] = static_cast<uint8_t>((txn_id >> 16) & 0xFF);
+    out[6] = static_cast<uint8_t>((txn_id >> 8) & 0xFF);
+    out[7] = static_cast<uint8_t>(txn_id & 0xFF);
     return out;
 }

 static TxnOperationArgs deserialize(const std::vector<uint8_t>& in) {
     TxnOperationArgs args;
     if (in.size() >= 8) {
-        std::memcpy(&args.txn_id, in.data(), 8);
+        args.txn_id = (static_cast<uint64_t>(in[0]) << 56) |
+                      (static_cast<uint64_t>(in[1]) << 48) |
+                      (static_cast<uint64_t>(in[2]) << 40) |
+                      (static_cast<uint64_t>(in[3]) << 32) |
+                      (static_cast<uint64_t>(in[4]) << 24) |
+                      (static_cast<uint64_t>(in[5]) << 16) |
+                      (static_cast<uint64_t>(in[6]) << 8) |
+                      static_cast<uint64_t>(in[7]);
     }
     return args;
 }

#!/bin/bash
# Verify all RPC multi-byte fields avoid raw host-endian memcpy in this header.
rg -nP 'std::memcpy\([^;]*&\s*(txn_id|err_len|row_count|magic|payload_len)\s*,\s*(4|8)\)' include/network/rpc_message.hpp -C2
rg -nP '\b(htonl|htons|ntohl|ntohs)\b' include/network/rpc_message.hpp -C2

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/network/rpc_message.hpp` around lines 150 - 160, The
serialize/deserialize for TxnOperationArgs currently memcpy host-endian txn_id
directly; change both TxnOperationArgs::serialize and
TxnOperationArgs::deserialize to convert txn_id to a fixed network byte order
before writing and convert back after reading (e.g., use a portable 64-bit
host-to-network and network-to-host conversion like htobe64/be64toh or an inline
byte-swap/shift helper) so the serialized bytes are consistent across endianness
and then memcpy or assign those bytes; update only the logic around txn_id in
serialize() and deserialize() to perform the conversion.

98-113: ⚠️ Potential issue | 🔴 Critical

Fix wire-format safety in QueryResultsReply (bounds + byte order).

Line 129 reads 4 bytes before verifying in.size() >= 5, which is an OOB risk on short payloads. Also, Lines 105/112/129/136 currently serialize/deserialize multi-byte integers in host byte order, which breaks cross-architecture RPC compatibility.

Proposed fix

 [[nodiscard]] std::vector<uint8_t> serialize() const {
     std::vector<uint8_t> out;
     out.push_back(success ? 1 : 0);

     uint32_t err_len = static_cast<uint32_t>(error_msg.size());
+    uint32_t n_err_len = htonl(err_len);
     size_t offset = out.size();
     out.resize(offset + 4 + err_len);
-    std::memcpy(out.data() + offset, &err_len, 4);
+    std::memcpy(out.data() + offset, &n_err_len, 4);
     std::memcpy(out.data() + offset + 4, error_msg.data(), err_len);

     // Simplified row count serialization
     uint32_t row_count = static_cast<uint32_t>(rows.size());
+    uint32_t n_row_count = htonl(row_count);
     offset = out.size();
     out.resize(offset + 4);
-    std::memcpy(out.data() + offset, &row_count, 4);
+    std::memcpy(out.data() + offset, &n_row_count, 4);

     return out;
 }

 static QueryResultsReply deserialize(const std::vector<uint8_t>& in) {
     QueryResultsReply reply;
-    if (in.empty()) {
+    if (in.size() < 5) {  // success(1) + err_len(4)
         return reply;
     }

     reply.success = in[0] != 0;

-    uint32_t err_len = 0;
-    std::memcpy(&err_len, in.data() + 1, 4);
-    if (in.size() >= 5 + err_len) {
-        reply.error_msg = std::string(reinterpret_cast<const char*>(in.data() + 5), err_len);
-    }
+    uint32_t err_len_net = 0;
+    std::memcpy(&err_len_net, in.data() + 1, 4);
+    const uint32_t err_len = ntohl(err_len_net);
+    if (in.size() < 5ULL + err_len) return reply;
+    reply.error_msg = std::string(reinterpret_cast<const char*>(in.data() + 5), err_len);

-    uint32_t row_count = 0;
-    if (in.size() >= 9 + err_len) {
-        std::memcpy(&row_count, in.data() + 5 + err_len, 4);
-        reply.rows.resize(row_count);  // Placeholders
-    }
+    if (in.size() < 9ULL + err_len) return reply;
+    uint32_t row_count_net = 0;
+    std::memcpy(&row_count_net, in.data() + 5 + err_len, 4);
+    const uint32_t row_count = ntohl(row_count_net);
+    reply.rows.resize(row_count);  // Placeholders

     return reply;
 }

Also applies to: 120-137

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/network/rpc_message.hpp` around lines 98 - 113, The
serialize/deserialize for QueryResultsReply (serialize() and the corresponding
deserialization code) is unsafe: it reads fixed-size fields without verifying
buffer length and uses host byte order for multi-byte integers; fix by
validating buffer size before every read (ensure at least 1 + 4 + err_len + 4 +
... bytes exist before memcpy/reads) and convert all multi-byte integers to/from
network byte order (use htonl on serialize for uint32_t fields like err_len and
row_count and ntohl on deserialize) so the wire format is consistent across
architectures; apply the same bounds checks and byte-order conversions to the
other affected functions mentioned (the deserialize counterpart and any code
around lines handling err_len/row_count).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/network/rpc_message.hpp`:
- Around line 134-138: The code unsafely calls reply.rows.resize(row_count) with
untrusted row_count from the wire; before resizing, validate row_count against a
safe maximum and ensure the buffer contains enough bytes for that many rows to
avoid integer overflow and excessive allocation. Concretely, in the block that
reads row_count (using variables row_count, err_len, in, and reply.rows.resize),
add checks: cap row_count to a defined MAX_ROWS constant, verify that
multiplying row_count by the per-row serialized size cannot overflow and that
in.size() is large enough to contain the expected row data, and only then call
reply.rows.resize(row_count); if any check fails, treat the message as malformed
and return an error.
- Around line 57-67: The RpcHeader::decode function currently assumes an 8-byte
buffer; change its API to accept a buffer length (e.g., decode(const char* in,
size_t len)) and validate len >= 8 before reading, returning an error (throw,
optional, or sentinel RpcHeader) on short input; update all call sites
(rpc_server.cpp and rpc_client.cpp) that call RpcHeader::decode to pass the
actual bytes received and handle the error path when fewer than 8 bytes were
returned by recv() so you never memcpy/read uninitialized/partial data.

---

Duplicate comments:
In `@include/network/rpc_message.hpp`:
- Around line 150-160: The serialize/deserialize for TxnOperationArgs currently
memcpy host-endian txn_id directly; change both TxnOperationArgs::serialize and
TxnOperationArgs::deserialize to convert txn_id to a fixed network byte order
before writing and convert back after reading (e.g., use a portable 64-bit
host-to-network and network-to-host conversion like htobe64/be64toh or an inline
byte-swap/shift helper) so the serialized bytes are consistent across endianness
and then memcpy or assign those bytes; update only the logic around txn_id in
serialize() and deserialize() to perform the conversion.
- Around line 98-113: The serialize/deserialize for QueryResultsReply
(serialize() and the corresponding deserialization code) is unsafe: it reads
fixed-size fields without verifying buffer length and uses host byte order for
multi-byte integers; fix by validating buffer size before every read (ensure at
least 1 + 4 + err_len + 4 + ... bytes exist before memcpy/reads) and convert all
multi-byte integers to/from network byte order (use htonl on serialize for
uint32_t fields like err_len and row_count and ntohl on deserialize) so the wire
format is consistent across architectures; apply the same bounds checks and
byte-order conversions to the other affected functions mentioned (the
deserialize counterpart and any code around lines handling err_len/row_count).

In `@src/main.cpp`:
- Around line 271-283: The TxnPrepare RPC handler (RpcType::TxnPrepare) and the
other 2PC handlers must not acknowledge success for unknown transactions: fetch
and validate the transaction by txn_id (using get_transaction(...)) before
flushing/applying any work, and if get_transaction(...) returns null/none return
a failure reply (set reply.success = false and include a descriptive error/errno
in the reply) instead of proceeding; update the handlers referenced (the
TxnPrepare lambda and the other 2PC lambdas around where get_transaction(...) is
used) to early-return on a null transaction and only perform log_manager->flush
or other commit/prepare actions when the transaction exists.
- Around line 150-158: The command-line parsing currently maps any unrecognized
--mode value to cloudsql::config::RunMode::Standalone; change this to treat
unknown values as an error: in the arg-handling block that reads mode_str (the
branch that sets config.mode), validate that mode_str is exactly
"coordinator"/"distributed" or "data" and if not, emit a clear error message
(including the invalid mode), print usage/help, and exit with a non-zero status
instead of assigning RunMode::Standalone; update any tests or callers expecting
the old fallback behavior if necessary.
- Around line 260-267: The response write currently narrows resp_p.size() into
resp_h.payload_len (uint16_t) and performs raw send(fd, ...) calls without
checking return values or handling partial writes or SIGPIPE/EPIPE; to fix,
validate that resp_p.size() fits in uint16_t before the static_cast and
return/handle an error if it overflows, then replace the two raw send() calls
around resp_h.encode(...) and resp_p.data() with a robust write loop (e.g., a
send_all helper used by the code) that checks send()’s return value, handles
short writes by advancing the buffer and retrying until all bytes are sent, and
detects/EPIPE or other fatal errors to stop and close the fd; also suppress
SIGPIPE for these sends (use MSG_NOSIGNAL on send or ensure SIGPIPE is ignored)
and surface/log errors where appropriate (refer to resp_p, resp_h, payload_len,
encode, and the send(fd, ...) callsites).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 954b6ec and 7f01d81.

📒 Files selected for processing (2)

include/network/rpc_message.hpp
src/main.cpp

coderabbitai · 2026-03-01T13:31:03Z

include/network/rpc_message.hpp

+    static RpcHeader decode(const char* in) {
+        RpcHeader h;
+        uint32_t n_magic = 0;
+        uint16_t n_len = 0;
+        std::memcpy(&n_magic, in, 4);
+        h.magic = ntohl(n_magic);
+        h.type = static_cast<RpcType>(static_cast<uint8_t>(in[4]));
+        h.flags = static_cast<uint8_t>(in[5]);
+        std::memcpy(&n_len, in + 6, 2);
+        h.payload_len = ntohs(n_len);
+        return h;


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the RpcHeader class and decode method find . -type f -name "*.hpp" -o -name "*.h" -o -name "*.cpp" | head -20

Repository: poyrazK/cloudSQL

Length of output: 714

🏁 Script executed:

# Search for RpcHeader::decode calls rg -n 'RpcHeader::decode' --type cpp --type hpp

Repository: poyrazK/cloudSQL

Length of output: 87

🏁 Script executed:

# Look for HEADER_SIZE definition rg -n 'HEADER_SIZE' --type cpp --type hpp

Repository: poyrazK/cloudSQL

Length of output: 87

🏁 Script executed:

# Find the actual decode function implementation rg -n 'static.*RpcHeader.*decode' -A 15 --type cpp --type hpp

Repository: poyrazK/cloudSQL

Length of output: 87

🏁 Script executed:

# Search for decode calls without type filter rg -n 'RpcHeader::decode'

Repository: poyrazK/cloudSQL

Length of output: 263

🏁 Script executed:

# Search for HEADER_SIZE rg -n 'HEADER_SIZE'

Repository: poyrazK/cloudSQL

Length of output: 1440

🏁 Script executed:

# Read the rpc_message.hpp file to see the implementation cat -n include/network/rpc_message.hpp

Repository: poyrazK/cloudSQL

Length of output: 6267

🏁 Script executed:

# Look for decode() function calls more broadly rg -n '\.decode\(' -A 2 -B 2

Repository: poyrazK/cloudSQL

Length of output: 1352

🏁 Script executed:

# Examine the rpc_server.cpp call site with context sed -n '95,115p' src/network/rpc_server.cpp

Repository: poyrazK/cloudSQL

Length of output: 779

🏁 Script executed:

# Examine the rpc_client.cpp call site with context sed -n '65,85p' src/network/rpc_client.cpp

Repository: poyrazK/cloudSQL

Length of output: 683

🏁 Script executed:

# Look for header_buf declaration and initialization in rpc_server.cpp sed -n '80,110p' src/network/rpc_server.cpp

Repository: poyrazK/cloudSQL

Length of output: 1014

🏁 Script executed:

# Look for header_buf declaration and initialization in rpc_client.cpp sed -n '50,80p' src/network/rpc_client.cpp

Repository: poyrazK/cloudSQL

Length of output: 843

Harden RpcHeader::decode API with input-length validation.

decode(const char* in) reads 8 bytes unconditionally. Both call sites in rpc_server.cpp (line 104) and rpc_client.cpp (line 74) use recv() to fill an 8-byte buffer but only check if recv() returned ≤ 0; they do not validate that exactly 8 bytes were received. If recv() returns fewer than 8 bytes (a valid partial read), decode() will read uninitialized memory from the buffer, causing an OOB read and undefined behavior.

Proposed fix

- static RpcHeader decode(const char* in) { - RpcHeader h; + static bool decode(const char* in, size_t in_len, RpcHeader& h) { + if (in == nullptr || in_len < HEADER_SIZE) { + return false; + } uint32_t n_magic = 0; uint16_t n_len = 0; std::memcpy(&n_magic, in, 4); h.magic = ntohl(n_magic); h.type = static_cast<RpcType>(static_cast<uint8_t>(in[4])); h.flags = static_cast<uint8_t>(in[5]); std::memcpy(&n_len, in + 6, 2); h.payload_len = ntohs(n_len); - return h; + return true; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_message.hpp` around lines 57 - 67, The RpcHeader::decode function currently assumes an 8-byte buffer; change its API to accept a buffer length (e.g., decode(const char* in, size_t len)) and validate len >= 8 before reading, returning an error (throw, optional, or sentinel RpcHeader) on short input; update all call sites (rpc_server.cpp and rpc_client.cpp) that call RpcHeader::decode to pass the actual bytes received and handle the error path when fewer than 8 bytes were returned by recv() so you never memcpy/read uninitialized/partial data.

coderabbitai · 2026-03-01T13:31:03Z

include/network/rpc_message.hpp

+        uint32_t row_count = 0;
+        if (in.size() >= 9 + err_len) {
+            std::memcpy(&row_count, in.data() + 5 + err_len, 4);
+            reply.rows.resize(row_count);  // Placeholders
+        }


⚠️ Potential issue | 🟠 Major

Bound row_count before resizing to prevent memory exhaustion.

Line 137 trusts untrusted wire data and does rows.resize(row_count) directly. A malicious payload can force excessive allocation and crash the process.

Proposed fix

uint32_t row_count = 0; if (in.size() >= 9 + err_len) { std::memcpy(&row_count, in.data() + 5 + err_len, 4); + constexpr uint32_t kMaxPlaceholderRows = 100000; + if (row_count > kMaxPlaceholderRows) { + reply.success = false; + reply.error_msg = "row_count exceeds protocol limit"; + return reply; + } reply.rows.resize(row_count); // Placeholders }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_message.hpp` around lines 134 - 138, The code unsafely calls reply.rows.resize(row_count) with untrusted row_count from the wire; before resizing, validate row_count against a safe maximum and ensure the buffer contains enough bytes for that many rows to avoid integer overflow and excessive allocation. Concretely, in the block that reads row_count (using variables row_count, err_len, in, and reply.rows.resize), add checks: cap row_count to a defined MAX_ROWS constant, verify that multiplying row_count by the per-row serialized size cannot overflow and that in.size() is large enough to contain the expected row data, and only then call reply.rows.resize(row_count); if any check fails, treat the message as malformed and return an error.

poyrazK added 18 commits March 1, 2026 15:21

build: expand config with distributed run modes and ports

a520346

build: integrate GMock and enable coverage instrumentation

2816ad5

feat: implement ClusterManager for node discovery and tracking

0174e2d

feat: add PREPARED state to Transaction state machine

712672f

feat: add PREPARE record type to WAL for 2PC recovery

08983b0

feat: define internal custom binary RPC protocol

c87a049

feat: implement internal RPC server transport

d5361aa

feat: implement internal RPC client transport

73b061e

feat: define Raft consensus message types

418c3c6

feat: implement core Raft consensus state machine

e8426b6

feat: implement consistent hashing for shard management

61f0479

feat: implement 2PC Resource Manager logic in TransactionManager

f5f82bb

feat: implement DistributedExecutor with shard routing and 2PC coordi…

53fbfce

…nation

feat: integrate Raft-backed catalog replication

a1f2eb7

feat: bootstrap distributed roles and 2PC handlers in main

9425eb0

test: add Raft election and heartbeat simulation tests

c414c4e

test: add sharding and 2PC failure simulation tests

c9cfd0a

test: update server tests for distributed configuration

42e33ad

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

style: apply clang-format fixes

ea1e95a

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

poyrazK added 3 commits March 1, 2026 15:40

style: fix struct initialization spacing for CI compliance

00bc145

style: final spacing and multi-line formatting for CI

14a182d

style: fix remaining clang-tidy warnings and formatting issues

954b6ec

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

include/network/rpc_message.hpp Show resolved Hide resolved

poyrazK added 4 commits March 1, 2026 16:01

style: remove extra newline and finalize headers for CI

420206c

fix: address unused variable in main.cpp

635023f

fix: rename args to cmd_args in main to avoid shadowing

b05f2be

style: apply clang-format wrapping to cerr in main.cpp

7f01d81

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

poyrazK merged commit e17c063 into main Mar 1, 2026
5 checks passed

poyrazK deleted the feat/phase4-final-split branch March 1, 2026 13:32

coderabbitai bot mentioned this pull request Mar 2, 2026

feat: implement distributed Shuffle Join (Phase 6) #6

Merged

-        constexpr size_t BASE_SIZE = 24;
-        out.resize(BASE_SIZE + candidate_id.size());
-        std::memcpy(out.data(), &term, sizeof(term_t));
-        const uint64_t id_len = candidate_id.size();
-        std::memcpy(out.data() + 8, &id_len, 8);
-        std::memcpy(out.data() + 16, candidate_id.data(), id_len);
-        std::memcpy(out.data() + 16 + id_len, &last_log_index, sizeof(index_t));
-        std::memcpy(out.data() + 24 + id_len, &last_log_term, sizeof(term_t));
-        return out;
+        const uint64_t id_len = static_cast<uint64_t>(candidate_id.size());
+        constexpr size_t FIXED_SIZE =
+            sizeof(term_t) + sizeof(uint64_t) + sizeof(index_t) + sizeof(term_t);
+        out.resize(FIXED_SIZE + id_len);
+        size_t offset = 0;
+        std::memcpy(out.data() + offset, &term, sizeof(term_t));
+        offset += sizeof(term_t);
+        std::memcpy(out.data() + offset, &id_len, sizeof(uint64_t));
+        offset += sizeof(uint64_t);
+        std::memcpy(out.data() + offset, candidate_id.data(), id_len);
+        offset += id_len;
+        std::memcpy(out.data() + offset, &last_log_index, sizeof(index_t));
+        offset += sizeof(index_t);
+        std::memcpy(out.data() + offset, &last_log_term, sizeof(term_t));
+        return out;

Conversation

poyrazK commented Mar 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

poyrazK commented Mar 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 1, 2026 •

edited

Loading