Phase 4: Shard-Aware Routing and Hardened 2PC#4
Conversation
📝 WalkthroughWalkthroughAdds distributed functionality: Raft consensus node, RPC client/server, cluster manager, shard routing, distributed executor with 2PC, and related Catalog/Transaction changes plus tests and CMake test target updates. Changes
Sequence DiagramssequenceDiagram
participant Client
participant Coordinator as Coordinator Server
participant DistExec as DistributedExecutor
participant DataNode1 as Data Node 1
participant DataNode2 as Data Node 2
Client->>Coordinator: COMMIT statement
Coordinator->>DistExec: execute(COMMIT)
Note over DistExec: Phase 1 - Prepare
DistExec->>DataNode1: TxnPrepare RPC
DistExec->>DataNode2: TxnPrepare RPC
DataNode1-->>DistExec: PREPARED
DataNode2-->>DistExec: PREPARED
alt All prepared
Note over DistExec: Phase 2 - Commit
DistExec->>DataNode1: TxnCommit RPC
DistExec->>DataNode2: TxnCommit RPC
DataNode1-->>DistExec: COMMITTED
DataNode2-->>DistExec: COMMITTED
DistExec-->>Coordinator: success
else Prepare failed
Note over DistExec: Phase 2 - Abort
DistExec->>DataNode1: TxnAbort RPC
DistExec->>DataNode2: TxnAbort RPC
DataNode1-->>DistExec: ABORTED
DataNode2-->>DistExec: ABORTED
DistExec-->>Coordinator: error
end
Coordinator-->>Client: result
sequenceDiagram
participant Follower as RaftNode (candidate/follower)
participant Peer1 as RaftNode 2
participant Peer2 as RaftNode 3
Note over Follower: Election timeout
Follower->>Follower: increment term, vote self
Follower->>Peer1: RequestVote RPC
Follower->>Peer2: RequestVote RPC
Peer1-->>Follower: vote granted
Peer2-->>Follower: vote granted
Note over Follower: majority -> become Leader
loop Heartbeat interval (Leader)
Follower->>Peer1: AppendEntries (heartbeat)
Follower->>Peer2: AppendEntries (heartbeat)
Peer1-->>Follower: ACK
Peer2-->>Follower: ACK
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
include/recovery/log_record.hpp (1)
26-37:⚠️ Potential issue | 🟠 MajorPrevent WAL deserialization corruption from enum reordering.
Adding PREPARE between BEGIN and COMMIT shifts the numeric values of COMMIT (7→8), ABORT (8→9), and NEW_PAGE (9→10). Since serialization uses raw
memcpywithout versioning, old WAL logs with byte value 7 will deserialize as PREPARE instead of COMMIT during recovery, breaking correctness.Assign explicit numeric IDs to all enum members to keep values stable across additions:
Required fix
enum class LogRecordType : uint8_t { INVALID = 0, INSERT = 1, MARK_DELETE = 2, APPLY_DELETE = 3, ROLLBACK_DELETE = 4, UPDATE = 5, BEGIN = 6, COMMIT = 7, ABORT = 8, NEW_PAGE = 9, PREPARE = 10 };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/recovery/log_record.hpp` around lines 26 - 37, The LogRecordType enum must use stable explicit numeric values to avoid WAL deserialization corruption when new members are inserted; update the enum LogRecordType by assigning fixed integer values to every member (e.g., INVALID = 0, INSERT = 1, MARK_DELETE = 2, ..., NEW_PAGE = N) so existing on-disk bytes map to the intended meanings, and ensure any code that serializes/deserializes LogRecordType (uses memcpy or reads raw bytes) continues to operate on these explicit values; keep the same symbolic names (INSERT, MARK_DELETE, APPLY_DELETE, ROLLBACK_DELETE, UPDATE, BEGIN, PREPARE, COMMIT, ABORT, NEW_PAGE) but give each an explicit numeric initializer to lock their numeric IDs.
🟠 Major comments (14)
include/common/cluster_manager.hpp-59-64 (1)
59-64:⚠️ Potential issue | 🟠 MajorNode liveness never expires, so stale nodes stay routable.
last_heartbeatis tracked, but Lines 70-92 only checkis_active, and this flag is only set totrueinheartbeat()(Lines 62-63). Without timeout-based deactivation, failed nodes remain eligible for routing and 2PC fanout indefinitely.Also applies to: 70-92
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/common/cluster_manager.hpp` around lines 59 - 64, The code sets nodes_[id].is_active = true in heartbeat() but never expires nodes so stale nodes remain routable; update the liveness logic to use last_heartbeat as the source of truth: introduce a configurable timeout (e.g., heartbeat_timeout) and, wherever node liveness is checked (the code paths referencing is_active between the checks around lines 70-92), replace or augment the is_active check with a time-based check that compares std::chrono::system_clock::now() - nodes_[id].last_heartbeat > heartbeat_timeout and sets nodes_[id].is_active = false (under mutex_), or simply treat node as inactive when the delta exceeds the timeout; ensure modifications reference heartbeat(), nodes_, last_heartbeat, is_active and mutex_ so stale nodes are deactivated promptly.include/common/cluster_manager.hpp-36-44 (1)
36-44:⚠️ Potential issue | 🟠 MajorConstructor intent and behavior diverge: self node is never registered.
Line 37 says “Add self to node map”, but
nodes_is never updated in this branch. As a result,get_data_nodes()/get_coordinators()won’t include the local node unless another path registers it.🔧 Proposed fix
explicit ClusterManager(const config::Config* config) : config_(config) { // Add self to node map if in distributed mode if (config_ != nullptr && config_->mode != config::RunMode::Standalone) { self_node_.id = "local_node"; // Will be replaced by unique ID later self_node_.address = "127.0.0.1"; self_node_.cluster_port = config_->cluster_port; self_node_.role = config_->mode; self_node_.last_heartbeat = std::chrono::system_clock::now(); + nodes_.emplace(self_node_.id, self_node_); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/common/cluster_manager.hpp` around lines 36 - 44, The constructor for ClusterManager sets up self_node_ when not in Standalone mode but never inserts it into the nodes_ map, so local node lookups (e.g., get_data_nodes(), get_coordinators()) miss it; fix by adding self_node_ to nodes_ after it's populated (use nodes_.emplace(self_node_.id, self_node_) or nodes_[self_node_.id] = self_node_) inside the same branch of explicit ClusterManager(const config::Config* config) so the local node is registered immediately (still leave the placeholder id "local_node" to be replaced later).include/catalog/catalog.hpp-181-195 (1)
181-195: 🛠️ Refactor suggestion | 🟠 MajorRestrict local-only DDL APIs to prevent accidental consensus bypass.
create_table_local/drop_table_localare public at Lines 184 and 194. That exposes direct local mutation paths that can bypass Raft apply flow and risk catalog divergence.🔧 Proposed hardening
class Catalog { public: @@ - oid_t create_table_local(const std::string& table_name, std::vector<ColumnInfo> columns); @@ - bool drop_table_local(oid_t table_id); @@ private: + friend class raft::RaftNode; + oid_t create_table_local(const std::string& table_name, std::vector<ColumnInfo> columns); + bool drop_table_local(oid_t table_id);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/catalog/catalog.hpp` around lines 181 - 195, The public API exposes local-only DDL functions create_table_local and drop_table_local which can be accidentally invoked and bypass Raft; change their visibility to non-public by moving create_table_local and drop_table_local into the class's private (or protected) section in catalog.hpp (or mark them as internal with an access specifier) and add a clear comment above each explaining they are Raft-only internals; ensure any callers outside the class are updated to use the public create_table/drop_table RPC paths or friend accessors so no external code depends on these methods being public.src/common/config.cpp-56-57 (1)
56-57:⚠️ Potential issue | 🟠 Major
cluster_portandportparsing accept out-of-range values via wraparound.At lines 55 and 57, casting
std::stoi(value)touint16_tbefore bounds validation causes invalid inputs (e.g.,70000) to silently wrap to valid-looking values (e.g.,4464). The validation at line 135 checkingcluster_port > MAX_PORTis ineffective becausecluster_portis alreadyuint16_t, so this condition can never be true. Validate asintbefore casting touint16_t, and reject values outside the valid port range (1–65535).🔧 Proposed fix
if (key == "port") { - port = static_cast<uint16_t>(std::stoi(value)); + int parsed_port = 0; + try { + parsed_port = std::stoi(value); + } catch (const std::exception&) { + std::cerr << "Invalid port: " << value << "\n"; + return false; + } + if (parsed_port <= 0 || + parsed_port > static_cast<int>(MAX_PORT)) { + std::cerr << "Invalid port: " << value << "\n"; + return false; + } + port = static_cast<uint16_t>(parsed_port); } else if (key == "cluster_port") { - cluster_port = static_cast<uint16_t>(std::stoi(value)); + int parsed_cluster_port = 0; + try { + parsed_cluster_port = std::stoi(value); + } catch (const std::exception&) { + std::cerr << "Invalid cluster_port: " << value << "\n"; + return false; + } + if (parsed_cluster_port <= 0 || + parsed_cluster_port > static_cast<int>(MAX_PORT)) { + std::cerr << "Invalid cluster_port: " << value << "\n"; + return false; + } + cluster_port = static_cast<uint16_t>(parsed_cluster_port); } @@ if (port == 0 || port > MAX_PORT) { - std::cerr << "Invalid port number: " << port << "\n"; - return false; - } - - if (cluster_port == 0 || cluster_port > MAX_PORT) { std::cerr << "Invalid port number: " << port << "\n"; return false; } + + if (cluster_port == 0) { + std::cerr << "Invalid cluster port number: " << cluster_port << "\n"; + return false; + }Also affects: line 55 (
portparsing)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/common/config.cpp` around lines 56 - 57, The port parsing for both port and cluster_port currently casts std::stoi(value) directly to uint16_t (symbols: port, cluster_port), allowing out-of-range inputs to wrap; change parsing to read into a signed integer first (e.g., int parsed = std::stoi(value)), validate parsed is within 1..65535 (compare against MIN_PORT/ MAX_PORT or literal bounds) and only then assign static_cast<uint16_t>(parsed); update or remove the existing ineffective check that compares cluster_port (uint16_t) to MAX_PORT so validation occurs on the int before casting.include/distributed/shard_manager.hpp-27-27 (1)
27-27:⚠️ Potential issue | 🟠 MajorFail fast when shard count is zero.
Line 27 returning shard
0fornum_shards == 0hides invalid topology and can silently misroute operations.Proposed fix
- if (num_shards == 0) return 0; + if (num_shards == 0) { + throw std::invalid_argument("num_shards must be > 0"); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/distributed/shard_manager.hpp` at line 27, The function currently returns 0 when num_shards == 0 which hides an invalid topology; instead detect the zero-shard condition (the if (num_shards == 0) branch) and fail fast by raising a clear error (e.g., throw std::invalid_argument or std::runtime_error with a descriptive message) or using an assert, so callers do not silently get shard 0; update the branch that currently returns 0 to throw/include the chosen failure mechanism and ensure callers that call this function/documentation handle the exception.src/network/rpc_client.cpp-84-84 (1)
84-84:⚠️ Potential issue | 🟠 MajorGuard against payload length truncation in RPC header.
Line 84 casts
payload.size()touint16_twithout validation. When payloads exceed 65535 bytes, the header stores a truncated value while the full payload is sent on line 93. The receiver (incall()) reads this truncated size from the header and only receives that many bytes, leaving excess data in the stream and desynchronizing all subsequent message framing.Proposed fix
+#include <limits> ... - header.payload_len = static_cast<uint16_t>(payload.size()); + if (payload.size() > std::numeric_limits<uint16_t>::max()) { + return false; + } + header.payload_len = static_cast<uint16_t>(payload.size());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_client.cpp` at line 84, The header.payload_len is being assigned via static_cast<uint16_t>(payload.size()) which truncates lengths > 65535 and desynchronizes framing; update the sending code that sets header.payload_len (the statement assigning header.payload_len from payload.size()) to first validate payload.size() <= std::numeric_limits<uint16_t>::max() and handle violations by returning an error/throwing an exception or rejecting the RPC request with a clear error message; also ensure callers (e.g., the function that sends the header/payload and the receiver call() that reads header.payload_len) are left consistent by not sending the payload when the size check fails.src/distributed/distributed_executor.cpp-47-49 (1)
47-49:⚠️ Potential issue | 🟠 MajorHard-coded global transaction ID will collide across concurrent sessions.
Using
GLOBAL_TXN_ID = 1causes cross-session transaction aliasing and breaks 2PC isolation/correctness under concurrent clients.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/distributed/distributed_executor.cpp` around lines 47 - 49, The hard-coded constexpr GLOBAL_TXN_ID = 1 will collide across concurrent sessions; replace this static constant with a thread-safe, per-session transaction ID generator (e.g., a std::atomic<uint64_t> global counter or a UUID-based generator) and assign a unique txn id when creating the executor/transaction (look for uses in distributed_executor.cpp and the DistributedExecutor/TransactionManager constructors). Ensure the chosen generator is initialized once and accessed atomically (or use a session-specific counter combined with a unique session id) so each DistributedExecutor instance gets a unique txn id instead of reusing GLOBAL_TXN_ID.src/catalog/catalog.cpp-76-82 (1)
76-82:⚠️ Potential issue | 🟠 MajorRaft-enabled DDL path currently mutates local state without replication.
When
raft_node_is set, both methods still call local mutation directly. In multi-node mode this can diverge catalog metadata across nodes.💡 Safer interim behavior (fail fast until Raft replication is wired)
oid_t Catalog::create_table(const std::string& table_name, std::vector<ColumnInfo> columns) { if (raft_node_ != nullptr) { - /* TODO: Serialize DDL and replicate via Raft */ - /* For now, just call local to keep it working during Step 4 implementation */ - return create_table_local(table_name, std::move(columns)); + throw std::runtime_error("create_table via Raft is not implemented"); } return create_table_local(table_name, std::move(columns)); } @@ bool Catalog::drop_table(oid_t table_id) { if (raft_node_ != nullptr) { - /* TODO: Replicate via Raft */ - return drop_table_local(table_id); + throw std::runtime_error("drop_table via Raft is not implemented"); } return drop_table_local(table_id); }Also applies to: 112-117
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/catalog/catalog.cpp` around lines 76 - 82, When raft_node_ is non-null the code currently calls create_table_local (mutating local state without Raft replication); change the branch in the create_table path so that when raft_node_ != nullptr it fails fast with a clear error/Status (e.g., return an error result or throw an exception indicating "DDL replication via Raft not implemented") instead of calling create_table_local, and apply the same change pattern to the similar DDL paths (the other branch around lines 112-117) so no local mutation occurs when raft_node_ is set.src/distributed/distributed_executor.cpp-88-105 (1)
88-105:⚠️ Potential issue | 🟠 MajorPhase-2 commit failures are ignored, but success is still returned.
client.call(...)and reply status in phase 2 are discarded. A node-level commit failure can still produce an overall success response.🧭 Suggested result aggregation for phase 2
- std::vector<std::future<void>> phase2_futures; + std::vector<std::future<std::pair<bool, std::string>>> phase2_futures; for (const auto& node : data_nodes) { - phase2_futures.push_back(std::async(std::launch::async, [node, payload, phase2_type]() { + phase2_futures.push_back(std::async(std::launch::async, [node, payload, phase2_type]() { network::RpcClient client(node.address, node.cluster_port); if (client.connect()) { std::vector<uint8_t> resp_payload; - static_cast<void>(client.call(phase2_type, payload, resp_payload)); + if (!client.call(phase2_type, payload, resp_payload)) { + return std::make_pair(false, "[" + node.id + "] phase2 RPC failed"); + } + auto reply = network::QueryResultsReply::deserialize(resp_payload); + if (!reply.success) { + return std::make_pair(false, "[" + node.id + "] phase2 failed: " + reply.error_msg); + } + return std::make_pair(true, std::string{}); } + return std::make_pair(false, "[" + node.id + "] phase2 connect failed"); })); } - for (auto& f : phase2_futures) f.get(); + bool phase2_ok = true; + for (auto& f : phase2_futures) { + auto [ok, err] = f.get(); + if (!ok) { phase2_ok = false; errors += err + "; "; } + } - if (all_prepared) { + if (all_prepared && phase2_ok) { return QueryResult(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/distributed/distributed_executor.cpp` around lines 88 - 105, The phase-2 async tasks currently discard client.call(...) results and reply status, allowing node-level commit failures to be ignored; modify the lambda launched in phase2_futures to return a result (e.g., bool success and optional error string) instead of void, have it check the return value of network::RpcClient::call(phase2_type, payload, resp_payload) and parse resp_payload/status for errors, and propagate any failures (collect into a vector of failures or combine into errors) when you .get() the futures; finally, if any phase-2 call failed, set QueryResult res with an appropriate "Distributed transaction aborted" message (using the collected error details) rather than returning success, otherwise return success as before (use identifiers: phase2_futures, data_nodes, network::RpcClient, client.call, phase2_type, payload, resp_payload, all_prepared, errors, QueryResult).include/network/rpc_client.hpp-27-27 (1)
27-27:⚠️ Potential issue | 🟠 Major
is_connected()reads shared state without synchronization.
fd_is mutable connection state; reading it lock-free can race withconnect()/disconnect()updates.🔒 Suggested thread-safe accessor
- bool is_connected() const { return fd_ >= 0; } + bool is_connected() const { + std::scoped_lock<std::mutex> lock(mutex_); + return fd_ >= 0; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_client.hpp` at line 27, is_connected() currently reads the mutable connection field fd_ without synchronization and can race with connect()/disconnect() updates; modify is_connected() to acquire the same lock used to protect fd_ (e.g., the connection mutex or std::mutex guarding fd_) and read fd_ under that lock, or make fd_ an atomic<int> and use atomic load semantics if that fits your concurrency model; ensure the accessor uses the chosen synchronization primitive consistently with connect() and disconnect() to avoid data races.src/distributed/raft_node.cpp-137-140 (1)
137-140:⚠️ Potential issue | 🟠 Majorstep_down() called without holding mutex in do_candidate().
On line 138,
step_down(resp_term)is called without holdingmutex_. However,step_down()modifiespersistent_state_andstate_, which are also accessed under lock elsewhere. This creates a potential data race.Compare with line 204 in
handle_request_vote()wherestep_down()is correctly called with the lock held.Proposed fix to acquire lock before step_down
if (resp_term > args.term) { + const std::scoped_lock<std::mutex> lock(mutex_); step_down(resp_term); return; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/distributed/raft_node.cpp` around lines 137 - 140, In do_candidate(), calling step_down(resp_term) without holding mutex_ can cause races because step_down() mutates persistent_state_ and state_; change the code in do_candidate() so that you acquire the same lock used elsewhere (mutex_) before invoking step_down(resp_term) — e.g., wrap the step_down call with the scoped lock used in handle_request_vote() (the same mutex_ and locking pattern) so state updates happen under the lock and no other thread can concurrently access persistent_state_ or state_.include/network/rpc_message.hpp-144-160 (1)
144-160:⚠️ Potential issue | 🟠 MajorTxnOperationArgs also uses host byte order for txn_id.
Similar to
QueryResultsReply, thetxn_idis serialized in host byte order. For consistency withRpcHeaderand cross-platform compatibility, use network byte order.Proposed fix using manual byte order conversion
struct TxnOperationArgs { uint64_t txn_id = 0; [[nodiscard]] std::vector<uint8_t> serialize() const { std::vector<uint8_t> out(8); - std::memcpy(out.data(), &txn_id, 8); + // Manual conversion to network byte order (big-endian) + for (int i = 0; i < 8; ++i) { + out[7 - i] = static_cast<uint8_t>((txn_id >> (i * 8)) & 0xFF); + } return out; } static TxnOperationArgs deserialize(const std::vector<uint8_t>& in) { TxnOperationArgs args; if (in.size() >= 8) { - std::memcpy(&args.txn_id, in.data(), 8); + // Manual conversion from network byte order (big-endian) + args.txn_id = 0; + for (int i = 0; i < 8; ++i) { + args.txn_id = (args.txn_id << 8) | in[i]; + } } return args; } };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_message.hpp` around lines 144 - 160, TxnOperationArgs currently serializes txn_id in host byte order; update TxnOperationArgs::serialize and TxnOperationArgs::deserialize to convert txn_id to/from network byte order (big-endian) before copying to/from the byte vector. In serialize(), convert txn_id to a 64-bit network-order value (use a portable ntoh/hton64 helper or explicit byte-shifts) and write those bytes into the output vector; in deserialize(), read the 8 bytes into a 64-bit network-order value and convert it back to host order before assigning args.txn_id. Ensure the helper (e.g., hton64/ntoh64) is used consistently so TxnOperationArgs matches RpcHeader/QueryResultsReply semantics.include/network/rpc_message.hpp-97-117 (1)
97-117:⚠️ Potential issue | 🟠 MajorByte order inconsistency: QueryResultsReply uses host byte order.
The
RpcHeadercorrectly useshtonl/ntohlfor network byte order, butQueryResultsReply::serialize()writeserr_lenandrow_countdirectly viamemcpyin host byte order. This will cause protocol errors when communicating between machines with different endianness.Proposed fix to use network byte order
[[nodiscard]] std::vector<uint8_t> serialize() const { std::vector<uint8_t> out; out.push_back(success ? 1 : 0); - uint32_t err_len = static_cast<uint32_t>(error_msg.size()); + uint32_t err_len = htonl(static_cast<uint32_t>(error_msg.size())); size_t offset = out.size(); out.resize(offset + 4 + err_len); - std::memcpy(out.data() + offset, &err_len, 4); - std::memcpy(out.data() + offset + 4, error_msg.data(), err_len); + std::memcpy(out.data() + offset, &err_len, 4); // Already in network order + uint32_t raw_err_len = static_cast<uint32_t>(error_msg.size()); + std::memcpy(out.data() + offset + 4, error_msg.data(), raw_err_len); // Simplified row count serialization - uint32_t row_count = static_cast<uint32_t>(rows.size()); + uint32_t row_count = htonl(static_cast<uint32_t>(rows.size())); offset = out.size(); out.resize(offset + 4); std::memcpy(out.data() + offset, &row_count, 4);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_message.hpp` around lines 97 - 117, QueryResultsReply::serialize() writes err_len and row_count in host byte order causing endianness bugs; convert those uint32_t values to network byte order with htonl before memcpy (i.e., call htonl(err_len) and htonl(row_count) and memcpy the converted values), and ensure the corresponding reader (QueryResultsReply::deserialize or the code that parses these fields) uses ntohl when loading them back; update the places referencing err_len, row_count, error_msg, rows and follow the RpcHeader pattern of using htonl/ntohl for consistency.src/distributed/raft_node.cpp-83-96 (1)
83-96:⚠️ Potential issue | 🟠 MajorDouble random timeout generation in do_follower creates inconsistent behavior.
Line 84 generates a timeout, then line 88 inside the predicate generates a different random timeout for comparison. This means the condition compares elapsed time against a different value than the wait duration, potentially causing premature or delayed election timeouts.
Proposed fix to use consistent timeout
void RaftNode::do_follower() { const auto timeout = get_random_timeout(); std::unique_lock<std::mutex> lock(mutex_); - if (cv_.wait_for(lock, timeout, [this] { + const auto deadline = std::chrono::system_clock::now() + timeout; + if (cv_.wait_for(lock, timeout, [this, deadline] { return !running_ || - (std::chrono::system_clock::now() - last_heartbeat_ > get_random_timeout()); + (std::chrono::system_clock::now() > deadline); })) { if (!running_) { return; } // Election timeout reached, become candidate state_ = NodeState::Candidate; } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/distributed/raft_node.cpp` around lines 83 - 96, The follower loop computes a random timeout twice (once in timeout and again inside the wait predicate) causing inconsistent behavior; in RaftNode::do_follower you should compute the random timeout once and reuse it in the cv_.wait_for predicate (e.g., capture the timeout by value or compute a deadline = last_heartbeat_ + timeout and compare now > deadline) so the duration passed to wait_for and the elapsed-time check against last_heartbeat_ use the same value; update the lambda passed to cv_.wait_for to reference that single timeout/deadline and leave remaining logic (running_, state_ transition to Candidate) unchanged.
🟡 Minor comments (8)
include/catalog/catalog.hpp-75-79 (1)
75-79:⚠️ Potential issue | 🟡 MinorDefault-initialize
ShardInfoscalar fields.Line 76 (
shard_id) and Line 78 (port) are currently uninitialized on default construction. Initializing them to zero avoids accidental undefined reads in metadata paths.🔧 Proposed fix
struct ShardInfo { - uint32_t shard_id; + uint32_t shard_id = 0; std::string node_address; - uint16_t port; + uint16_t port = 0; };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/catalog/catalog.hpp` around lines 75 - 79, ShardInfo's scalar members shard_id and port are left uninitialized which can cause undefined reads; update the ShardInfo definition so shard_id and port are default-initialized (e.g., add in-class member initializers or a default constructor that sets shard_id = 0 and port = 0) while leaving node_address as-is, ensuring any default-constructed ShardInfo has deterministic zero values for those fields.tests/raft_tests.cpp-21-26 (1)
21-26:⚠️ Potential issue | 🟡 MinorAvoid fixed port in tests to reduce flakiness.
Lines 21 and 25 hardcode port
6000; this can fail non-deterministically when the port is in use.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/raft_tests.cpp` around lines 21 - 26, The test currently hardcodes TEST_PORT=6000 and uses it for config.cluster_port and constructing network::RpcServer, causing flakes when the port is occupied; change the test to allocate an ephemeral port instead (e.g., bind the server socket to port 0 or call a small helper that finds a free port), retrieve the assigned port, set config.cluster_port to that assigned port, and then construct cluster::ClusterManager and network::RpcServer with the dynamically assigned port (update references to TEST_PORT, config.cluster_port, cm and rpc accordingly).tests/raft_simulation_tests.cpp-24-24 (1)
24-24:⚠️ Potential issue | 🟡 MinorUse dynamic test ports to avoid collisions.
Lines 24 and 44 hardcode ports (
7000,7001), which can cause sporadic failures in CI or parallel runs.Also applies to: 44-44
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/raft_simulation_tests.cpp` at line 24, Tests hardcode fixed ports when constructing the RPC servers (network::RpcServer rpc(7000) and the similar instantiation at the other site), which causes port collisions in parallel/CI runs; change the tests to allocate dynamic ports by creating the RpcServer with port 0 (or use a helper that finds a free port) so the OS assigns an ephemeral port, then query the server for its actual bound port and use that value when composing peer addresses for other nodes; update both occurrences that pass literal 7000/7001 and ensure any follow-up setup (peer lists, client connections) uses the retrieved bound port from network::RpcServer instead of the hardcoded constants.tests/distributed_tests.cpp-43-46 (1)
43-46:⚠️ Potential issue | 🟡 MinorGuard parser output before dereference.
Line 45 dereferences
stmtwithout asserting parse success first; a parse failure becomes a crash instead of a clear test failure.Proposed fix
auto stmt = parser.parse_statement(); + ASSERT_NE(stmt, nullptr); auto res = exec.execute(*stmt, "CREATE TABLE test (id INT)");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/distributed_tests.cpp` around lines 43 - 46, The test currently dereferences the result of parser.parse_statement() (stmt) without checking it; change the test to assert the parse succeeded before using *stmt (e.g., use ASSERT_TRUE(stmt) or ASSERT_NE(stmt, nullptr) immediately after auto stmt = parser.parse_statement()) so the test fails cleanly on parse errors rather than crashing, then call exec.execute(*stmt, "CREATE TABLE test (id INT)") as before.src/main.cpp-304-308 (1)
304-308:⚠️ Potential issue | 🟡 Minor2PC commit/abort returns success even when transaction is not found.
In both
TxnCommit(lines 304-308) andTxnAbort(lines 332-336), ifget_transaction(args.txn_id)returns null,reply.successis still set totrue. This silently masks cases where the coordinator believes a commit/abort succeeded, but the data node had no record of the transaction.For 2PC correctness, consider returning an error or at least logging a warning when the transaction is not found.
Proposed fix to handle missing transactions
try { auto txn = transaction_manager.get_transaction(args.txn_id); if (txn) { transaction_manager.commit(txn); + reply.success = true; + } else { + // Transaction not found - may have already been cleaned up + reply.success = true; // Idempotent: treat as already committed + // Consider logging: std::cerr << "TxnCommit: txn not found\n"; } - reply.success = true; } catch (const std::exception& e) {Also applies to: 332-336
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.cpp` around lines 304 - 308, In TxnCommit and TxnAbort, when transaction_manager.get_transaction(args.txn_id) returns null you currently set reply.success = true; change this so a missing txn is treated as an error: check the returned txn pointer, and if null set reply.success = false (or set an error code/message) and log a warning including args.txn_id before returning; only call transaction_manager.commit(txn) or transaction_manager.abort(txn) and set reply.success = true when txn is non-null. Ensure the same handling and logging is applied in both TxnCommit and TxnAbort.src/distributed/raft_node.cpp-40-55 (1)
40-55:⚠️ Potential issue | 🟡 MinorRace condition: handlers registered after thread starts.
The
raft_thread_is started on line 42, but RPC handlers are registered on lines 45-54. If an RPC message arrives in this window, it won't be handled. Register handlers before starting the thread.Proposed fix to register handlers first
void RaftNode::start() { - running_ = true; - raft_thread_ = std::thread(&RaftNode::run_loop, this); - // Register handlers rpc_server_.set_handler( network::RpcType::RequestVote, [this](const network::RpcHeader& h, const std::vector<uint8_t>& p, int fd) { handle_request_vote(h, p, fd); }); rpc_server_.set_handler( network::RpcType::AppendEntries, [this](const network::RpcHeader& h, const std::vector<uint8_t>& p, int fd) { handle_append_entries(h, p, fd); }); + + running_ = true; + raft_thread_ = std::thread(&RaftNode::run_loop, this); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/distributed/raft_node.cpp` around lines 40 - 55, Move RPC handler registration in RaftNode::start so rpc_server_.set_handler calls for network::RpcType::RequestVote (invoking handle_request_vote) and network::RpcType::AppendEntries (invoking handle_append_entries) occur before starting raft_thread_ and setting running_. In other words, register both handlers first, then set running_ = true and create raft_thread_ with &RaftNode::run_loop so there is no window where incoming messages could arrive without handlers installed.src/main.cpp-259-265 (1)
259-265:⚠️ Potential issue | 🟡 MinorPotential payload truncation if response exceeds 64KB.
The
payload_lenfield inRpcHeaderisuint16_t, limiting payloads to 65535 bytes. The cast on line 261 (static_cast<uint16_t>(resp_p.size())) will silently truncate larger payloads, corrupting the protocol framing.For query results with many rows, this limit could be exceeded. Consider either validating the size before sending or extending the protocol to support larger payloads.
Add bounds check
auto resp_p = reply.serialize(); + if (resp_p.size() > UINT16_MAX) { + // Handle oversized payload - truncate rows or return error + reply.success = false; + reply.error_msg = "Response too large"; + reply.rows.clear(); + resp_p = reply.serialize(); + } cloudsql::network::RpcHeader resp_h; resp_h.type = cloudsql::network::RpcType::QueryResults; resp_h.payload_len = static_cast<uint16_t>(resp_p.size());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.cpp` around lines 259 - 265, The RpcHeader.payload_len is a uint16_t and casting resp_p.size() to uint16_t in the RpcHeader creation (RpcHeader resp_h; resp_h.payload_len = static_cast<uint16_t>(resp_p.size());) will truncate payloads >65535; fix by validating resp_p.size() before encoding (e.g., if resp_p.size() > std::numeric_limits<uint16_t>::max() then return an error or use an alternate code path), or implement framing that supports larger payloads (e.g., extend RpcHeader.payload_len to a 32-bit field or send a multi-part/chunked response) and update the send logic that follows (send(fd, h_buf, 8, 0); send(fd, resp_p.data(), resp_p.size(), 0);) accordingly so no silent truncation occurs.tests/server_tests.cpp-101-117 (1)
101-117:⚠️ Potential issue | 🟡 MinorHandshake test silently passes if connection fails.
If
connect()fails on line 101, the entire handshake validation is skipped and the test passes. Consider usingASSERT_EQto fail the test if connection fails.Proposed fix to assert connection success
- if (connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)) == 0) { + ASSERT_EQ(connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)), 0) + << "Failed to connect to server"; + { // Send startup packet const std::array<uint32_t, 2> startup = {htonl(static_cast<uint32_t>(STARTUP_PKT_LEN)), htonl(196608)};🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/server_tests.cpp` around lines 101 - 117, The test currently uses if (connect(sock, ... ) == 0) { ... } so a failed connect silently skips assertions; change this to assert the connection succeeded using the test framework (e.g. replace the if with ASSERT_EQ(connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)), 0)) so the test fails immediately on connection error and then keep the existing startup/send/recv assertions (references: connect, sock, STARTUP_PKT_LEN, buffer).
🧹 Nitpick comments (13)
src/common/config.cpp (1)
67-73: Prefer explicit failure for unknownmodevalues.Line 72 silently falls back to
Standalonefor unrecognized tokens, which can mask bad configs and start the wrong topology. Consider returningfalseon unknown mode strings.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/common/config.cpp` around lines 67 - 73, The current assignment silently maps any unrecognized string to RunMode::Standalone; change the logic in the function that parses 'value' so it only accepts the explicit tokens "distributed"/"coordinator" (set mode = RunMode::Coordinator) and "data" (set mode = RunMode::Data), and treat all other inputs as an error: do not set mode = RunMode::Standalone, instead log or report the invalid token and return false (or propagate failure) so the caller can reject the configuration; reference variables/functions: value, mode, and enum RunMode (RunMode::Coordinator, RunMode::Data, RunMode::Standalone) when implementing this validation change.include/distributed/shard_manager.hpp (1)
9-12: Add direct standard headers used by this header.This file uses
std::optional,std::nullopt, andstd::hashdirectly; add<optional>and<functional>to avoid transitive-include fragility.Also applies to: 38-45
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/distributed/shard_manager.hpp` around lines 9 - 12, The header uses std::optional, std::nullopt and std::hash directly but relies on transitive includes; add explicit `#include` <optional> and `#include` <functional> at the top of include/distributed/shard_manager.hpp (alongside the existing <cstdint>, <string>, <vector>) so types like std::optional/std::nullopt and std::hash are always available for symbols used in this file (e.g. any declarations/uses in the ShardManager-related code).tests/distributed_tests.cpp (1)
37-39: Set distributed mode explicitly inDDLRoutingtest setup.Set
config.mode = config::RunMode::Coordinatorso this test can’t silently exercise a non-distributed default path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/distributed_tests.cpp` around lines 37 - 39, In the DDLRouting test setup, explicitly set the run mode on the config object to prevent the test from defaulting to a non-distributed path: after creating config::Config config, assign config.mode = config::RunMode::Coordinator before constructing ClusterManager cm(&config) and DistributedExecutor exec(*catalog, cm); this ensures the test uses the distributed Coordinator mode when invoking ClusterManager and DistributedExecutor.CMakeLists.txt (1)
126-129: Pin googletest to a tag/commit for reproducible builds.Using
refs/heads/main.zipmakes builds non-reproducible and increases supply-chain risk.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CMakeLists.txt` around lines 126 - 129, The FetchContent_Declare usage for googletest currently points to an unstable branch URL (URL https://github.com/google/googletest/archive/refs/heads/main.zip) which makes builds non-reproducible; update the FetchContent_Declare for the googletest dependency (the googletest declaration/URL) to reference a specific tag or commit (e.g., use the archive URL for a release tag or switch to GIT_REPOSITORY + GIT_TAG with a fixed tag/commit SHA) so the same version is fetched every build and supply-chain risk is reduced.tests/raft_simulation_tests.cpp (1)
32-37: Add a post-timeout assertion inFollowerToCandidate.After waiting on Lines 32–33, there is no assertion, so this test can pass without verifying election behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/raft_simulation_tests.cpp` around lines 32 - 37, After the sleep in the FollowerToCandidate test, add an assertion that the node transitioned out of Follower (i.e., is Candidate or Leader). Locate the FollowerToCandidate test in tests/raft_simulation_tests.cpp and, immediately after the std::this_thread::sleep_for(...) call, assert the node's state using the project's state accessor (e.g., node->GetState() or raft_node.GetState()) with the testing macro (EXPECT_TRUE or ASSERT_TRUE) to check GetState() == State::Candidate || GetState() == State::Leader so the test verifies election behavior.tests/raft_tests.cpp (1)
18-29:StateTransitionscurrently validates only initial state.After setup, the test only checks
EXPECT_FALSE(node.is_leader())and does not assert any transition behavior. Either add a transition assertion or rename the test to reflect initial-state validation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/raft_tests.cpp` around lines 18 - 29, The test named TEST(RaftTests, StateTransitions) only verifies the initial state via RaftNode::is_leader() and does not exercise any state transitions; either rename the test to indicate it checks the initial state (e.g., "InitialState") or implement transition assertions: initialize the RaftNode ("node1", cm, rpc), trigger a transition path (for example start the node, advance timers or simulate an election timeout / append entries from peers using ClusterManager cm or network::RpcServer rpc hooks), then assert the expected state change via node.is_leader() (or other state-query methods) and any relevant side-effects; update the test name if you choose the rename approach so the test intent matches its claims.tests/distributed_txn_tests.cpp (1)
105-106: Replace fixed sleeps with deterministic synchronization in assertions.Using fixed
50mssleeps makes these tests timing-sensitive and flaky under slower CI hosts. Prefer waiting on a condition/poll-with-timeout tied to the atomic counters (or assert immediately ifexecute()is expected to block through phase 2).Also applies to: 212-214
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/distributed_txn_tests.cpp` around lines 105 - 106, Replace the fixed std::this_thread::sleep_for(50ms) and EXPECT_EQ(commit_count.load(), 2) with deterministic polling or condition-wait: repeatedly check commit_count.load() (or wait on a std::condition_variable signaled by the code that increments commit_count) until it reaches the expected value or a reasonable timeout elapses, and then ASSERT/EXPECT that the count reached 2 (or fail the test on timeout); apply the same change for the other occurrence around execute() so the test no longer relies on a fixed 50ms sleep and instead deterministically waits for commit_count to update.src/catalog/catalog.cpp (1)
95-100: Avoid persisting hardcoded shard endpoints in table metadata.Writing
127.0.0.1:6432for every new table bakes incorrect topology into catalog state outside single-node setups. This should come from actual cluster/shard assignment logic (or remain unset until assigned).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/catalog/catalog.cpp` around lines 95 - 100, The code is persisting a hardcoded shard endpoint by populating ShardInfo with shard.node_address = "127.0.0.1" and shard.port = 6432 and always pushing it into table->shards; remove this baked-in topology by not setting shard.node_address/shard.port to production values (leave them empty/zero) and avoid pushing the shard into table->shards until a real cluster assignment occurs. Locate the ShardInfo construction (symbols: ShardInfo, shard.node_address, shard.port, table->shards.push_back) in catalog.cpp and either omit adding the shard here or call the proper cluster/shard assignment routine (e.g., the cluster manager or AssignShard method) to populate and persist endpoint details only after a real assignment.src/distributed/raft_node.cpp (2)
273-277: Usemutablekeyword instead ofconst_cast.The
const_caston line 275 is a code smell indicating the design is inconsistent. Sinceget_random_timeout()is markedconstbut needs to modifyrng_, the cleaner solution is to declarerng_asmutablein the header.Proposed change in header (include/distributed/raft_node.hpp)
- std::mt19937 rng_; + mutable std::mt19937 rng_;Then in the implementation:
std::chrono::milliseconds RaftNode::get_random_timeout() const { std::uniform_int_distribution<int> dist(TIMEOUT_MIN_MS, TIMEOUT_MAX_MS); - auto& mutable_rng = const_cast<std::mt19937&>(rng_); - return std::chrono::milliseconds(dist(mutable_rng)); + return std::chrono::milliseconds(dist(rng_)); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/distributed/raft_node.cpp` around lines 273 - 277, The method RaftNode::get_random_timeout() is const but mutates rng_, so remove the const_cast by declaring the RNG member rng_ as mutable in the RaftNode class (in the header) and then simplify get_random_timeout() to use rng_ directly; update the declaration of rng_ (std::mt19937 rng_) to mutable std::mt19937 rng_ and remove the const_cast in get_random_timeout().
279-280: Missing state persistence breaks Raft safety guarantees.The empty
persist_state()andload_state()implementations mean that after a restart, a node could vote twice in the same term or lose track of committed entries. This violates Raft's safety properties.While acceptable for a POC, this should be tracked as a high-priority TODO.
Would you like me to open an issue to track implementing persistent state storage for Raft durability?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/distributed/raft_node.cpp` around lines 279 - 280, The RaftNode::persist_state() and RaftNode::load_state() stubs must be implemented to persist and recover Raft durable state: serialize and atomically write currentTerm, votedFor, and the stable prefix of the log (or at least the metadata needed for safety) in persist_state(), and read/validate/restore those values in load_state() at startup; ensure writes are atomic (write-to-temp+rename or use fsync) and call persist_state() whenever currentTerm or votedFor changes or entries are committed/installed (e.g., after Election, Vote handling, and AppendEntries apply paths), and have load_state() validate format/version/CRC and restore RaftNode::currentTerm, RaftNode::votedFor and log structures so the node cannot vote twice or lose committed entries.src/main.cpp (2)
269-294: TxnPrepare handler ignores the transaction ID.The
argsvariable is deserialized butargs.txn_idis never used. The comment on lines 277-278 acknowledges this is a simplified implementation. Consider adding a TODO comment referencing the transaction lookup requirement, or using[[maybe_unused]]to suppress warnings.Mark unused variable explicitly
(void)h; - auto args = cloudsql::network::TxnOperationArgs::deserialize(p); + [[maybe_unused]] auto args = cloudsql::network::TxnOperationArgs::deserialize(p);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.cpp` around lines 269 - 294, The deserialized args (cloudsql::network::TxnOperationArgs args) in the TxnPrepare handler lambda is never used; either mark it explicitly unused or add a TODO to avoid confusion: inside the lambda for RpcType::TxnPrepare (the handler that currently calls log_manager->flush), annotate the unused variable with [[maybe_unused]] or replace it with [[maybe_unused]] auto args = ... and/or add a clear TODO comment referencing that the implementation should look up the transaction by args.txn_id and flush that txn's WAL (e.g., "TODO: use args.txn_id to find txn and flush its WAL"), keeping the rest of the handler logic unchanged.
228-266: Lambda captures references to stack-local objects.The RPC handler lambdas capture
catalog,bpm,lock_manager,transaction_manager, andlog_managerby reference. These objects are declared earlier inmain()and remain valid for the lifetime of the program, so the captures are safe in this context.However, the
send()calls on lines 264-265 don't check return values. If the client disconnects mid-response, partial writes could occur silently.Consider checking send() return values
- send(fd, h_buf, 8, 0); - send(fd, resp_p.data(), resp_p.size(), 0); + if (send(fd, h_buf, 8, 0) < 0 || + send(fd, resp_p.data(), resp_p.size(), 0) < 0) { + // Log or handle send failure + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.cpp` around lines 228 - 266, The RPC handler registered via rpc_server->set_handler writes the header and payload with send(fd, h_buf, 8, 0) and send(fd, resp_p.data(), resp_p.size(), 0) but ignores return values and thus doesn't handle partial writes or client disconnects; update the handler to check and handle send() results for both the encoded header (resp_h.encode -> h_buf) and the payload (resp_p.data(), resp_p.size()) by looping until all bytes are written (or an error/EINTR/EAGAIN occurs), logging failures via log_manager and closing or cleaning up the socket (fd) on unrecoverable errors; ensure you reference and protect the same symbols used here (rpc_server->set_handler lambda, resp_h.encode, h_buf, resp_p, send) when implementing the write-loop and error handling.tests/server_tests.cpp (1)
38-41: Consider test isolation for shared data directory.All tests use
"./test_data"as the data directory. While these tests focus on server lifecycle rather than storage, shared state could cause flakiness. Consider using unique directories per test or a test fixture with setup/teardown.Also applies to: 51-56
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/server_tests.cpp` around lines 38 - 41, The tests instantiate StorageManager with a hardcoded "./test_data" which can create shared-state flakiness; change tests using StorageManager, storage::BufferPoolManager, and Server to use a unique temporary directory per test (e.g., construct a unique path via std::filesystem::temp_directory_path() + test-specific suffix or timestamp) and ensure cleanup in a test fixture SetUp()/TearDown() (or RAII helper) that creates the dir before creating StorageManager and removes it after the test completes to isolate state across tests.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (29)
CMakeLists.txtinclude/catalog/catalog.hppinclude/common/cluster_manager.hppinclude/common/config.hppinclude/distributed/distributed_executor.hppinclude/distributed/raft_node.hppinclude/distributed/raft_types.hppinclude/distributed/shard_manager.hppinclude/network/rpc_client.hppinclude/network/rpc_message.hppinclude/network/rpc_server.hppinclude/network/server.hppinclude/recovery/log_record.hppinclude/transaction/transaction.hppinclude/transaction/transaction_manager.hppsrc/catalog/catalog.cppsrc/common/config.cppsrc/distributed/distributed_executor.cppsrc/distributed/raft_node.cppsrc/main.cppsrc/network/rpc_client.cppsrc/network/rpc_server.cppsrc/network/server.cppsrc/transaction/transaction_manager.cpptests/distributed_tests.cpptests/distributed_txn_tests.cpptests/raft_simulation_tests.cpptests/raft_tests.cpptests/server_tests.cpp
| # Configure Sanitizers | ||
| if (NOT USE_SANITIZER STREQUAL "none") | ||
| if (CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU") | ||
| set(CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "CMakeLists.txt" -type fRepository: poyrazK/cloudSQL
Length of output: 76
🏁 Script executed:
cat -n CMakeLists.txtRepository: poyrazK/cloudSQL
Length of output: 7632
Remove the invalid set(CMAKE_CXX_COMPILER_ID MATCHES ...) assignment on line 23.
Line 23 uses invalid CMake syntax—the set() command does not accept the MATCHES operator. This corrupts CMAKE_CXX_COMPILER_ID and breaks downstream compiler checks on lines 39 and 87. The line is also redundant since it's already within the if (CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU") block from line 22.
Proposed fix
if (CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU")
- set(CMAKE_CXX_COMPILER_ID MATCHES "Clang|GNU")
string(REPLACE "," ";" SAN_LIST ${USE_SANITIZER})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@CMakeLists.txt` at line 23, Remove the invalid assignment that uses the
MATCHES operator inside set(): delete the line "set(CMAKE_CXX_COMPILER_ID
MATCHES "Clang|GNU")" so CMAKE_CXX_COMPILER_ID is not overwritten; the correct
approach is to keep the surrounding "if (CMAKE_CXX_COMPILER_ID MATCHES
"Clang|GNU")" condition and remove the redundant set(...) call that breaks
downstream checks referencing CMAKE_CXX_COMPILER_ID (used later around the
existing checks on lines near 39 and 87).
| constexpr size_t BASE_SIZE = 24; | ||
| out.resize(BASE_SIZE + candidate_id.size()); | ||
| std::memcpy(out.data(), &term, sizeof(term_t)); | ||
| const uint64_t id_len = candidate_id.size(); | ||
| std::memcpy(out.data() + 8, &id_len, 8); | ||
| std::memcpy(out.data() + 16, candidate_id.data(), id_len); | ||
| std::memcpy(out.data() + 16 + id_len, &last_log_index, sizeof(index_t)); | ||
| std::memcpy(out.data() + 24 + id_len, &last_log_term, sizeof(term_t)); | ||
| return out; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let me read the file to see the actual code at the specified lines
cat -n include/distributed/raft_types.hpp | head -100Repository: poyrazK/cloudSQL
Length of output: 3123
🏁 Script executed:
# Search for the RequestVoteArgs class and serialize method
rg -A 20 "struct RequestVoteArgs" include/distributed/raft_types.hppRepository: poyrazK/cloudSQL
Length of output: 804
🏁 Script executed:
# Search for type definitions of index_t and term_t
rg "using (index_t|term_t)" include/distributed/raft_types.hppRepository: poyrazK/cloudSQL
Length of output: 110
RequestVoteArgs::serialize() under-allocates and causes out-of-bounds write.
Line 48 allocates 24 + id_len bytes, but line 54 writes 8 additional bytes at offset 24 + id_len, exceeding the buffer by 8 bytes. This is a critical memory corruption bug.
🔧 Proposed fix
[[nodiscard]] std::vector<uint8_t> serialize() const {
std::vector<uint8_t> out;
- constexpr size_t BASE_SIZE = 24;
- out.resize(BASE_SIZE + candidate_id.size());
- std::memcpy(out.data(), &term, sizeof(term_t));
- const uint64_t id_len = candidate_id.size();
- std::memcpy(out.data() + 8, &id_len, 8);
- std::memcpy(out.data() + 16, candidate_id.data(), id_len);
- std::memcpy(out.data() + 16 + id_len, &last_log_index, sizeof(index_t));
- std::memcpy(out.data() + 24 + id_len, &last_log_term, sizeof(term_t));
+ const uint64_t id_len = static_cast<uint64_t>(candidate_id.size());
+ constexpr size_t FIXED_SIZE =
+ sizeof(term_t) + sizeof(uint64_t) + sizeof(index_t) + sizeof(term_t);
+ out.resize(FIXED_SIZE + id_len);
+
+ size_t offset = 0;
+ std::memcpy(out.data() + offset, &term, sizeof(term_t));
+ offset += sizeof(term_t);
+ std::memcpy(out.data() + offset, &id_len, sizeof(uint64_t));
+ offset += sizeof(uint64_t);
+ std::memcpy(out.data() + offset, candidate_id.data(), id_len);
+ offset += id_len;
+ std::memcpy(out.data() + offset, &last_log_index, sizeof(index_t));
+ offset += sizeof(index_t);
+ std::memcpy(out.data() + offset, &last_log_term, sizeof(term_t));
return out;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| constexpr size_t BASE_SIZE = 24; | |
| out.resize(BASE_SIZE + candidate_id.size()); | |
| std::memcpy(out.data(), &term, sizeof(term_t)); | |
| const uint64_t id_len = candidate_id.size(); | |
| std::memcpy(out.data() + 8, &id_len, 8); | |
| std::memcpy(out.data() + 16, candidate_id.data(), id_len); | |
| std::memcpy(out.data() + 16 + id_len, &last_log_index, sizeof(index_t)); | |
| std::memcpy(out.data() + 24 + id_len, &last_log_term, sizeof(term_t)); | |
| return out; | |
| const uint64_t id_len = static_cast<uint64_t>(candidate_id.size()); | |
| constexpr size_t FIXED_SIZE = | |
| sizeof(term_t) + sizeof(uint64_t) + sizeof(index_t) + sizeof(term_t); | |
| out.resize(FIXED_SIZE + id_len); | |
| size_t offset = 0; | |
| std::memcpy(out.data() + offset, &term, sizeof(term_t)); | |
| offset += sizeof(term_t); | |
| std::memcpy(out.data() + offset, &id_len, sizeof(uint64_t)); | |
| offset += sizeof(uint64_t); | |
| std::memcpy(out.data() + offset, candidate_id.data(), id_len); | |
| offset += id_len; | |
| std::memcpy(out.data() + offset, &last_log_index, sizeof(index_t)); | |
| offset += sizeof(index_t); | |
| std::memcpy(out.data() + offset, &last_log_term, sizeof(term_t)); | |
| return out; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/distributed/raft_types.hpp` around lines 47 - 55, The buffer is
under-allocated in RequestVoteArgs::serialize(): BASE_SIZE (24) +
candidate_id.size() is too small because additional writes for last_log_index
and last_log_term occur after that, causing an OOB write. Fix by computing
total_size = /* header sizes */ + candidate_id.size() + sizeof(index_t) +
sizeof(term_t) (or simply resize to BASE_SIZE + candidate_id.size() +
sizeof(index_t) + sizeof(term_t)), update out.resize(total_size), and replace
hardcoded offset constants (8,16,24) with expressions using sizeof(term_t) and
candidate_id.size() so memcpy targets use correct offsets when writing term,
id_len, candidate_id, last_log_index, and last_log_term in
RequestVoteArgs::serialize().
| static uint32_t compute_shard(const common::Value& pk_value, uint32_t num_shards) { | ||
| if (num_shards == 0) return 0; | ||
|
|
||
| // Simple hash for demo purposes | ||
| std::string s = pk_value.to_string(); | ||
| size_t hash = std::hash<std::string>{}(s); | ||
| return static_cast<uint32_t>(hash % num_shards); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n include/distributed/shard_manager.hppRepository: poyrazK/cloudSQL
Length of output: 1696
🏁 Script executed:
# Check if <functional> is needed for std::hash
rg "std::hash" --type cpp
# Search for shard routing usage
rg "compute_shard" --type cpp
# Check if there are other hash implementations in the codebase
rg "FNV|fnv|deterministic.*hash|stable.*hash" --type cpp -i
# Verify if ShardInfo and TableInfo are defined
rg "struct ShardInfo|class ShardInfo" --type hpp
# Check for num_shards == 0 usage patterns
rg "num_shards.*0|0.*shard" --type cpp -A 2 -B 2Repository: poyrazK/cloudSQL
Length of output: 1391
🏁 Script executed:
# Check the distributed_tests.cpp to see what the hash stability test expects
cat -n tests/distributed_tests.cpp | head -100
# Check if <functional> is included
rg "#include.*functional" include/distributed/shard_manager.hpp
# Look at Value::to_string() implementation
rg -A 10 "to_string()" include/common/value.hpp
# Check common/value.hpp hash implementation context
head -50 include/common/value.hppRepository: poyrazK/cloudSQL
Length of output: 4315
🏁 Script executed:
# Check if <functional> is included transitively
rg "#include" include/distributed/shard_manager.hpp
# Check what's in common/value.hpp that might pull in <functional>
rg "#include" include/common/value.hpp
# Check the actual distributed executor usage context
cat -n src/distributed/distributed_executor.cpp | grep -A 5 -B 5 "compute_shard"Repository: poyrazK/cloudSQL
Length of output: 871
🏁 Script executed:
# Check if there's any runtime validation or assertions for num_shards
rg "num_shards|data_nodes.size()" src/distributed/distributed_executor.cpp -B 3 -A 3
# Verify if the Value class has its own stable hash implementation
grep -A 20 "struct Hash" include/common/value.hpp
# Check if the code is actually in production or marked as demo/WIP
rg "TODO|FIXME|demo|WIP" include/distributed/shard_manager.hppRepository: poyrazK/cloudSQL
Length of output: 1131
Replace std::hash<std::string> with deterministic hashing for distributed consistency.
Line 31 uses std::hash<std::string>, which is implementation-defined and produces different hashes across compiler versions, platforms, and standard library implementations. In a distributed system, the same key must consistently hash to the same shard regardless of which node performs the computation. The current approach can silently route identical keys to different shards across nodes.
The "demo purposes" comment (line 29) confirms this is intentionally temporary. Replace with a stable hash algorithm like FNV-1a (as proposed below) or another deterministic variant that produces consistent output across all builds and deployments.
Proposed fix
- // Simple hash for demo purposes
- std::string s = pk_value.to_string();
- size_t hash = std::hash<std::string>{}(s);
- return static_cast<uint32_t>(hash % num_shards);
+ const std::string s = pk_value.to_string();
+ uint64_t hash = 1469598103934665603ULL; // FNV-1a 64-bit
+ for (unsigned char c : s) {
+ hash ^= static_cast<uint64_t>(c);
+ hash *= 1099511628211ULL;
+ }
+ return static_cast<uint32_t>(hash % num_shards);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| static uint32_t compute_shard(const common::Value& pk_value, uint32_t num_shards) { | |
| if (num_shards == 0) return 0; | |
| // Simple hash for demo purposes | |
| std::string s = pk_value.to_string(); | |
| size_t hash = std::hash<std::string>{}(s); | |
| return static_cast<uint32_t>(hash % num_shards); | |
| } | |
| static uint32_t compute_shard(const common::Value& pk_value, uint32_t num_shards) { | |
| if (num_shards == 0) return 0; | |
| const std::string s = pk_value.to_string(); | |
| uint64_t hash = 1469598103934665603ULL; // FNV-1a 64-bit | |
| for (unsigned char c : s) { | |
| hash ^= static_cast<uint64_t>(c); | |
| hash *= 1099511628211ULL; | |
| } | |
| return static_cast<uint32_t>(hash % num_shards); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/distributed/shard_manager.hpp` around lines 26 - 33, compute_shard
uses std::hash which is non-deterministic across platforms; replace it with a
deterministic algorithm (e.g., FNV-1a) so shard assignment is consistent across
nodes. In compute_shard(const common::Value& pk_value, uint32_t num_shards)
compute a stable 64-bit FNV-1a hash over the bytes of pk_value.to_string() (or
its canonical byte representation), then return static_cast<uint32_t>(fnv_hash %
num_shards); preserve the num_shards==0 early return and ensure the hashing uses
fixed-width types (uint64_t) and byte-wise processing to guarantee
cross-platform stability.
| static QueryResultsReply deserialize(const std::vector<uint8_t>& in) { | ||
| QueryResultsReply reply; | ||
| if (in.empty()) return reply; | ||
|
|
||
| reply.success = in[0] != 0; | ||
|
|
||
| uint32_t err_len; | ||
| std::memcpy(&err_len, in.data() + 1, 4); | ||
| if (in.size() >= 5 + err_len) { | ||
| reply.error_msg = std::string(reinterpret_cast<const char*>(in.data() + 5), err_len); | ||
| } | ||
|
|
||
| uint32_t row_count; | ||
| if (in.size() >= 9 + err_len) { | ||
| std::memcpy(&row_count, in.data() + 5 + err_len, 4); | ||
| reply.rows.resize(row_count); // Placeholders | ||
| } | ||
|
|
||
| return reply; | ||
| } |
There was a problem hiding this comment.
Out-of-bounds read in QueryResultsReply::deserialize.
Line 126 reads 4 bytes at offset 1 (in.data() + 1) but only checks in.empty() beforehand. If in.size() is between 1 and 4, this causes undefined behavior. The bounds check on line 127 happens after the read.
Proposed fix to add bounds check before read
static QueryResultsReply deserialize(const std::vector<uint8_t>& in) {
QueryResultsReply reply;
- if (in.empty()) return reply;
+ if (in.size() < 5) return reply; // Need at least: success(1) + err_len(4)
reply.success = in[0] != 0;
uint32_t err_len;
std::memcpy(&err_len, in.data() + 1, 4);
+ err_len = ntohl(err_len); // Convert from network byte order
if (in.size() >= 5 + err_len) {
reply.error_msg = std::string(reinterpret_cast<const char*>(in.data() + 5), err_len);
}
uint32_t row_count;
- if (in.size() >= 9 + err_len) {
+ if (in.size() >= 9 + err_len) { // success(1) + err_len(4) + err_msg + row_count(4)
std::memcpy(&row_count, in.data() + 5 + err_len, 4);
+ row_count = ntohl(row_count); // Convert from network byte order
reply.rows.resize(row_count); // Placeholders
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/network/rpc_message.hpp` around lines 119 - 138, The deserialize
function QueryResultsReply::deserialize performs unguarded reads (std::memcpy)
for err_len and row_count causing potential OOB; before reading the 4-byte
err_len at in.data()+1 ensure in.size() >= 1 + 4, and before reading row_count
ensure in.size() >= 5 + err_len + 4; if the buffer is too small, return a
default/partial reply or set reply.success=false and reply.error_msg to an
explanatory message; adjust the checks around err_len and row_count (and any
subsequent reads) so every memcpy or pointer-based string construction is
preceded by a size check using in.size() and the computed err_len/row_count
values.
| std::thread accept_thread_; | ||
| std::vector<std::thread> worker_threads_; | ||
| std::unordered_map<RpcType, RpcHandler> handlers_; | ||
| std::mutex handlers_mutex_; |
There was a problem hiding this comment.
Add synchronization/state fields for safe worker lifecycle management.
Given concurrent accept_loop() and stop() behavior, the class needs dedicated synchronization (e.g., thread/client mutex and tracked client fds) to avoid races/hangs during shutdown.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/network/rpc_server.hpp` around lines 60 - 63, The class lacks
lifecycle synchronization for concurrent accept_loop() and stop(), leading to
races/hangs; add explicit state and synchronization fields (e.g., an
atomic<bool> running or stopping flag, a std::mutex clients_mutex, a
std::unordered_set<int> client_fds to track active clients, and a
std::condition_variable shutdown_cv) and use them in accept_loop(), worker
threads, and stop() to: (1) check/flip the running flag to prevent new accepts
after stop(), (2) protect modifications to client_fds with clients_mutex, (3)
notify/wait on shutdown_cv to join worker_threads_ cleanly, and (4) guard
handlers_ using handlers_mutex_ where handlers are modified or invoked; update
accept_thread_, worker_threads_, accept_loop(), and stop() to consult these new
fields to ensure orderly shutdown.
| prepare_futures.push_back(std::async(std::launch::async, [&node, payload]() { | ||
| network::RpcClient client(node.address, node.cluster_port); | ||
| if (client.connect()) { | ||
| std::vector<uint8_t> resp_payload; | ||
| if (client.call(network::RpcType::TxnPrepare, payload, resp_payload)) { | ||
| auto reply = network::QueryResultsReply::deserialize(resp_payload); | ||
| if (reply.success) return std::make_pair(true, std::string("")); | ||
| return std::make_pair(false, "[" + node.id + "] Prepare failed: " + reply.error_msg); | ||
| } | ||
| return std::make_pair(false, "[" + node.id + "] RPC failed during prepare"); | ||
| } | ||
| return std::make_pair(false, "[" + node.id + "] Connection failed during prepare"); | ||
| })); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the file structure and read the relevant sections
wc -l src/distributed/distributed_executor.cppRepository: poyrazK/cloudSQL
Length of output: 104
🏁 Script executed:
#!/bin/bash
# Read lines 50-130 to see the loop context and all three locations mentioned
sed -n '50,130p' src/distributed/distributed_executor.cppRepository: poyrazK/cloudSQL
Length of output: 3412
🏁 Script executed:
#!/bin/bash
# Verify the exact pattern: std::async with [&node] captures
rg -n 'std::async.*\[&node' src/distributed/distributed_executor.cppRepository: poyrazK/cloudSQL
Length of output: 352
Async tasks capture loop variable node by reference, creating dangling references.
In the TransactionCommit (lines 60, 90), and TransactionRollback (line 115) handlers, the async lambdas capture node by reference within the for (const auto& node : data_nodes) loop. Since the futures are stored and awaited after the loop completes, the reference becomes invalid when the loop variable goes out of scope.
✅ Capture `node` by value
- prepare_futures.push_back(std::async(std::launch::async, [&node, payload]() {
+ prepare_futures.push_back(std::async(std::launch::async, [node, payload]() {
@@
- phase2_futures.push_back(std::async(std::launch::async, [&node, payload, phase2_type]() {
+ phase2_futures.push_back(std::async(std::launch::async, [node, payload, phase2_type]() {
@@
- rollback_futures.push_back(std::async(std::launch::async, [&node, payload]() {
+ rollback_futures.push_back(std::async(std::launch::async, [node, payload]() {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/distributed/distributed_executor.cpp` around lines 60 - 72, The async
lambda captures the loop variable node by reference which leads to dangling
references when futures outlive the loop; update the capture in the
prepare_futures (and analogous TransactionCommit and TransactionRollback) async
lambdas to capture node by value (e.g., capture node instead of &node or use
node = node) so each future holds its own copy of the node and avoid
use-after-scope issues when awaiting the stored futures.
| bool RpcClient::call(RpcType type, const std::vector<uint8_t>& payload, | ||
| std::vector<uint8_t>& response_out) { | ||
| if (!send_only(type, payload)) { | ||
| return false; | ||
| } | ||
|
|
||
| char header_buf[8]; | ||
| if (recv(fd_, header_buf, 8, 0) <= 0) { | ||
| return false; | ||
| } | ||
|
|
||
| const RpcHeader resp_header = RpcHeader::decode(header_buf); | ||
| response_out.resize(resp_header.payload_len); | ||
| if (resp_header.payload_len > 0) { | ||
| static_cast<void>(recv(fd_, response_out.data(), resp_header.payload_len, 0)); | ||
| } | ||
|
|
||
| return true; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -100 src/network/rpc_client.cpp | cat -nRepository: poyrazK/cloudSQL
Length of output: 3097
🏁 Script executed:
find . -type f -name "*.h" | xargs grep -l "class RpcClient" | head -1Repository: poyrazK/cloudSQL
Length of output: 42
🏁 Script executed:
head -150 src/network/rpc_client.cpp | cat -nRepository: poyrazK/cloudSQL
Length of output: 3138
Add loops to handle TCP partial sends/receives and protect recv calls with mutex lock.
The recv() calls on lines 63 and 70 don't hold the mutex and can return fewer bytes than requested. This causes two critical issues:
- Partial transfers not handled: TCP can return fewer bytes than requested; the code treats any non-negative return as success, causing data loss or truncation.
- Interleaved responses: Multiple threads can call
call()concurrently. Aftersend_only()releases the lock (line 58), another thread can send its request before the first thread'srecv()(line 63) completes, causing response mixing.
The same issue affects send() on lines 89 and 93—partial sends are not retried.
Wrap recv/send calls in loops that continue until all bytes are transferred or an error occurs. Extend the mutex lock scope in call() to cover the entire recv sequence (lines 63-70).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/network/rpc_client.cpp` around lines 56 - 74, The RpcClient::call
implementation must handle partial TCP transfers and prevent interleaved
responses by extending the mutex lock around the entire receive sequence and
retrying partial recv/write until all bytes are transferred or an error occurs:
inside RpcClient::call, after send_only(...) returns, acquire the connection
mutex and perform a looped recv on fd_ to read exactly 8 header bytes (feeding
into header_buf) and then decode with RpcHeader::decode, resize response_out,
and looped recv until resp_header.payload_len bytes are fully read into
response_out; similarly, update the send(...) implementation (and any send-only
helper used by send_only) to loop until the full buffer is written (handling
partial returns and errors) so that both send and recv always transfer the
requested byte counts and the call() mutex covers the entire receive phase to
avoid interleaving between threads.
| const std::scoped_lock<std::mutex> lock(mutex_); | ||
| if (fd_ < 0 && !connect()) { | ||
| return false; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the send_only() and connect() implementations
rg -A 20 "bool RpcClient::send_only" src/network/rpc_client.cppRepository: poyrazK/cloudSQL
Length of output: 627
🏁 Script executed:
# Also check the connect() method implementation
rg -A 20 "bool RpcClient::connect()" src/network/rpc_client.cppRepository: poyrazK/cloudSQL
Length of output: 626
🏁 Script executed:
# Check the mutex_ declaration and type
rg "mutex_" src/network/rpc_client.cpp | head -20Repository: poyrazK/cloudSQL
Length of output: 218
🏁 Script executed:
# Let's also check the header file to see class definition
fd "rpc_client.h" src/network/Repository: poyrazK/cloudSQL
Length of output: 42
🏁 Script executed:
# Find and read the rpc_client.h header
rg -l "rpc_client" src/network/ | grep "\.h"Repository: poyrazK/cloudSQL
Length of output: 42
🏁 Script executed:
# Read the class definition and member variables
rg -B 5 -A 10 "std::mutex.*mutex_|mutex_.*std::mutex" src/network/rpc_client.hRepository: poyrazK/cloudSQL
Length of output: 124
🏁 Script executed:
# Search for the header file more broadly
fd -e h "rpc_client"Repository: poyrazK/cloudSQL
Length of output: 42
🏁 Script executed:
# Also check the actual directory structure of src/network
ls -la src/network/Repository: poyrazK/cloudSQL
Length of output: 369
🏁 Script executed:
# Check if there's a header file somewhere else in the repo
fd -e h "rpc_client"Repository: poyrazK/cloudSQL
Length of output: 42
🏁 Script executed:
# Look at the beginning of rpc_client.cpp for class definition or includes
head -50 src/network/rpc_client.cppRepository: poyrazK/cloudSQL
Length of output: 1180
🏁 Script executed:
# Find the rpc_client.hpp header file
fd -e hpp "rpc_client"Repository: poyrazK/cloudSQL
Length of output: 90
🏁 Script executed:
# Read the header file to confirm mutex_ type
cat -n include/network/rpc_client.hppRepository: poyrazK/cloudSQL
Length of output: 1464
Fix self-deadlock in send_only() when reconnecting.
When fd_ is negative, line 78 calls connect(), which attempts to lock mutex_ (line 43 in header: std::mutex mutex_). However, the same mutex is already held by std::scoped_lock at line 77. Since std::mutex is non-recursive, this causes a deadlock.
The proposed fix using std::unique_lock with explicit unlock/lock around the connect() call resolves this while maintaining thread safety (the fd_ check in connect() prevents race conditions).
Proposed fix
-bool RpcClient::send_only(RpcType type, const std::vector<uint8_t>& payload) {
- const std::scoped_lock<std::mutex> lock(mutex_);
- if (fd_ < 0 && !connect()) {
- return false;
- }
+bool RpcClient::send_only(RpcType type, const std::vector<uint8_t>& payload) {
+ std::unique_lock<std::mutex> lock(mutex_);
+ if (fd_ < 0) {
+ lock.unlock();
+ if (!connect()) {
+ return false;
+ }
+ lock.lock();
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const std::scoped_lock<std::mutex> lock(mutex_); | |
| if (fd_ < 0 && !connect()) { | |
| return false; | |
| bool RpcClient::send_only(RpcType type, const std::vector<uint8_t>& payload) { | |
| std::unique_lock<std::mutex> lock(mutex_); | |
| if (fd_ < 0) { | |
| lock.unlock(); | |
| if (!connect()) { | |
| return false; | |
| } | |
| lock.lock(); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/network/rpc_client.cpp` around lines 77 - 79, The send_only() function
holds mutex_ via std::scoped_lock and then calls connect(), but connect() also
locks mutex_, causing a self-deadlock; change send_only() to use
std::unique_lock<std::mutex> lock(mutex_) so you can temporarily unlock before
calling connect() and re-lock afterward (i.e., check fd_, if fd_ < 0 then
lock.unlock(); if (!connect()) return false; lock.lock();), preserving the fd_
checks and thread-safety around fd_ and using the existing mutex_ and connect()
implementation.
src/network/rpc_server.cpp
Outdated
| const ssize_t n = recv(client_fd, header_buf, 8, 0); | ||
| if (n <= 0) { | ||
| break; | ||
| } | ||
|
|
||
| const RpcHeader header = RpcHeader::decode(header_buf); | ||
| std::vector<uint8_t> payload(header.payload_len); | ||
| if (header.payload_len > 0) { | ||
| static_cast<void>(recv(client_fd, payload.data(), header.payload_len, 0)); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, verify the file exists and check the specific lines mentioned
echo "=== Checking file existence and line count ==="
wc -l src/network/rpc_server.cpp
echo -e "\n=== Reading lines 90-110 for context ==="
sed -n '90,110p' src/network/rpc_server.cppRepository: poyrazK/cloudSQL
Length of output: 859
🏁 Script executed:
#!/bin/bash
# Run the provided verification script to find recv() patterns
rg -nP 'recv\(client_fd,\s*header_buf,\s*8|recv\(client_fd,\s*payload\.data\(\),\s*header\.payload_len' src/network/rpc_server.cppRepository: poyrazK/cloudSQL
Length of output: 214
Fix incomplete recv() calls that can corrupt RPC framing.
TCP reads are not guaranteed to return the requested number of bytes. Line 95 only checks if (n <= 0) but does not verify that exactly 8 bytes were received; a partial header will be decoded incorrectly. Line 103 ignores the payload recv() return value entirely, silently accepting incomplete reads and causing protocol desynchronization on the next frame.
Implement a loop (e.g., recv_all()) to guarantee complete reads for both the 8-byte header and full payload:
Suggested fix pattern
+namespace {
+ssize_t recv_all(int fd, void* buf, size_t len) {
+ size_t total = 0;
+ auto* out = static_cast<char*>(buf);
+ while (total < len) {
+ const ssize_t n = recv(fd, out + total, len - total, 0);
+ if (n <= 0) return n;
+ total += static_cast<size_t>(n);
+ }
+ return static_cast<ssize_t>(total);
+}
+} // namespace
@@
- const ssize_t n = recv(client_fd, header_buf, 8, 0);
- if (n <= 0) {
+ const ssize_t n = recv_all(client_fd, header_buf, sizeof(header_buf));
+ if (n != static_cast<ssize_t>(sizeof(header_buf))) {
break;
}
@@
std::vector<uint8_t> payload(header.payload_len);
if (header.payload_len > 0) {
- static_cast<void>(recv(client_fd, payload.data(), header.payload_len, 0));
+ if (recv_all(client_fd, payload.data(), header.payload_len) !=
+ static_cast<ssize_t>(header.payload_len)) {
+ break;
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/network/rpc_server.cpp` around lines 95 - 104, The current recv usage can
return partial reads and corrupt RPC framing; implement a blocking helper like
recv_all(int fd, void* buf, size_t len) that loops until len bytes are read or
an error/EOF occurs, and use it to read the 8-byte header into header_buf
(before calling RpcHeader::decode) and to read header.payload_len bytes into
payload.data(); ensure you check the return of recv_all and handle short/failed
reads by closing the connection or breaking the loop (use the same client_fd,
header_buf, payload, RpcHeader::decode symbols to locate where to replace the
single recv calls).
| void TransactionManager::commit(Transaction* txn) { | ||
| if (txn->get_state() == TransactionState::COMMITTED) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Prevent invalid terminal-state transitions in commit() and abort().
Current guards allow committing an already-aborted transaction and aborting an already-committed transaction, which can corrupt transaction state/history.
✅ Suggested guard hardening
void TransactionManager::commit(Transaction* txn) {
+ if (txn == nullptr) {
+ return;
+ }
if (txn->get_state() == TransactionState::COMMITTED) {
return;
}
+ if (txn->get_state() == TransactionState::ABORTED) {
+ return;
+ }
@@
void TransactionManager::abort(Transaction* txn) {
- if (txn->get_state() == TransactionState::ABORTED) {
+ if (txn == nullptr) {
+ return;
+ }
+ if (txn->get_state() == TransactionState::ABORTED ||
+ txn->get_state() == TransactionState::COMMITTED) {
return;
}
- /* Undo all changes if not already committed */
- if (txn->get_state() != TransactionState::COMMITTED) {
- undo_transaction(txn);
- }
+ /* Undo all changes */
+ undo_transaction(txn);Also applies to: 115-117, 119-142
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/transaction/transaction_manager.cpp` around lines 75 - 78, The commit()
and abort() methods must reject transitions from one terminal state to the
other: in TransactionManager::commit(Transaction* txn) add a guard that if
txn->get_state() == TransactionState::ABORTED then do not proceed (log/return or
throw) so an aborted txn cannot be committed; likewise in
TransactionManager::abort(Transaction* txn) add a guard that if txn->get_state()
== TransactionState::COMMITTED then do not proceed so a committed txn cannot be
aborted. Use the existing Transaction::get_state() and TransactionState enum
values (COMMITTED, ABORTED) to implement these checks and ensure no
state/history mutation occurs when a terminal-state conflict is detected.
There was a problem hiding this comment.
Actionable comments posted: 18
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/server_tests.cpp (1)
1-124:⚠️ Potential issue | 🟡 Minor
clang-formatCI failure is still open for this file.GitHub Actions reports formatting violations here. Please run
clang-format -i tests/server_tests.cppand commit the result so the formatting gate passes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/server_tests.cpp` around lines 1 - 124, Run clang-format -i on the test file that contains the ServerTests (tests with TEST(ServerTests, StatusStrings), TEST(ServerTests, Lifecycle) and TEST(ServerTests, Handshake)) to fix the formatting violations reported by CI, then stage and commit the reformatted file so the formatting gate passes; no code changes besides formatting should be made.
♻️ Duplicate comments (8)
include/network/rpc_message.hpp (1)
120-135:⚠️ Potential issue | 🔴 CriticalFix unsafe and non-portable deserialization in
QueryResultsReply.Line 127 copies 4 bytes without first ensuring
in.size() >= 5, which is an OOB read risk. Also,err_len/row_countshould be normalized with network byte order for cross-node portability.Suggested fix
static QueryResultsReply deserialize(const std::vector<uint8_t>& in) { QueryResultsReply reply; - if (in.empty()) return reply; + if (in.size() < 5) return reply; // success(1) + err_len(4) reply.success = in[0] != 0; - uint32_t err_len; + uint32_t err_len = 0; std::memcpy(&err_len, in.data() + 1, 4); + err_len = ntohl(err_len); if (in.size() >= 5 + err_len) { reply.error_msg = std::string(reinterpret_cast<const char*>(in.data() + 5), err_len); } - uint32_t row_count; + uint32_t row_count = 0; if (in.size() >= 9 + err_len) { std::memcpy(&row_count, in.data() + 5 + err_len, 4); + row_count = ntohl(row_count); reply.rows.resize(row_count); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_message.hpp` around lines 120 - 135, In QueryResultsReply::deserialize, avoid the OOB read and non-portable integer parsing by first verifying in.size() >= 5 before copying err_len and then verifying in.size() >= 5 + err_len + 4 before reading row_count; replace the raw memcpy interpretation of err_len and row_count with network-byte-order-aware conversion (e.g., read uint32_t from the byte stream then apply ntohl/ntohl-equivalent) so values are normalized across hosts; update the checks around reply.error_msg and reply.rows resizing to use these validated, converted lengths and bail out or throw if sizes would overflow or exceed in.size() bounds.src/network/rpc_server.cpp (1)
93-102:⚠️ Potential issue | 🔴 CriticalUse
recv_allfor header/payload to avoid frame corruption.Line 93 and Line 101 assume one
recv()returns all requested bytes. TCP does not guarantee that, so header/payload can be truncated and misparsed.Suggested fix pattern
+namespace { +ssize_t recv_all(int fd, void* buf, size_t len) { + size_t off = 0; + auto* p = static_cast<char*>(buf); + while (off < len) { + const ssize_t n = recv(fd, p + off, len - off, 0); + if (n <= 0) return n; + off += static_cast<size_t>(n); + } + return static_cast<ssize_t>(off); +} +} // namespace @@ - const ssize_t n = recv(client_fd, header_buf, 8, 0); - if (n <= 0) { + const ssize_t n = recv_all(client_fd, header_buf, sizeof(header_buf)); + if (n != static_cast<ssize_t>(sizeof(header_buf))) { break; } @@ - static_cast<void>(recv(client_fd, payload.data(), header.payload_len, 0)); + if (recv_all(client_fd, payload.data(), header.payload_len) != + static_cast<ssize_t>(header.payload_len)) { + break; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_server.cpp` around lines 93 - 102, The recv calls reading the RPC frame are assuming full reads; replace them with a robust "recv_all" style loop to ensure exactly 8 bytes are read into header_buf and then exactly header.payload_len bytes into payload (or return/error if peer closes/short read). Update the code paths around header_buf, RpcHeader::decode(header_buf), payload and client_fd to call/use recv_all(client_fd, header_buf, 8) and recv_all(client_fd, payload.data(), header.payload_len) (or equivalent helper that returns bytes read or an error) and handle partial reads/EOF by breaking/closing the connection before attempting to decode or process a truncated frame.include/distributed/shard_manager.hpp (1)
26-33:⚠️ Potential issue | 🔴 CriticalUse deterministic hashing for shard selection.
Line 31 uses
std::hash<std::string>, which is not stable across implementations. That can route the same key to different shards on different nodes/builds.Suggested fix (deterministic FNV-1a 64-bit)
- // Simple hash for demo purposes - std::string s = pk_value.to_string(); - size_t hash = std::hash<std::string>{}(s); - return static_cast<uint32_t>(hash % num_shards); + const std::string s = pk_value.to_string(); + uint64_t hash = 1469598103934665603ULL; // FNV-1a offset basis + for (unsigned char c : s) { + hash ^= static_cast<uint64_t>(c); + hash *= 1099511628211ULL; // FNV prime + } + return static_cast<uint32_t>(hash % num_shards);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/distributed/shard_manager.hpp` around lines 26 - 33, compute_shard currently uses std::hash<std::string> which is non-deterministic across builds; replace it with a deterministic FNV-1a 64-bit hash implementation inside compute_shard (use the bytes of pk_value.to_string() or pk_value.data() if available), compute the 64-bit FNV-1a hash, then take hash % num_shards and return as uint32_t; ensure the function still handles num_shards==0 and casts the 64-bit result safely to uint32_t when returning.include/network/rpc_server.hpp (1)
54-65:⚠️ Potential issue | 🔴 CriticalAdd explicit client-lifecycle shutdown state to prevent stop() hangs.
The class still lacks client-fd tracking and shutdown coordination fields, so
stop()cannot proactively unblockhandle_client()threads waiting inrecv().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_server.hpp` around lines 54 - 65, Add explicit client-lifecycle shutdown state and coordination to allow stop() to unblock threads in recv(): introduce a thread-safe container for active client fds (e.g., std::unordered_set<int> client_fds_ with std::mutex client_fds_mutex_), an atomic<bool> shutting_down_{false} (or similar) and a std::condition_variable shutdown_cv_ to signal shutdown. Update accept_loop() to stop accepting when shutting_down_ is set; register/deregister fds in handle_client() using client_fds_ and periodically check shutting_down_ (or rely on shutdown_cv_) so the loop can exit; implement stop() to set shutting_down_, lock client_fds_mutex_, close all tracked fds to unblock recv(), notify shutdown_cv_, and join worker/accept threads. Ensure handlers_ synchronization remains unchanged.src/network/rpc_client.cpp (2)
58-73:⚠️ Potential issue | 🔴 CriticalHandle partial TCP transfers and keep response receive serialized per call.
Single
send()/recv()calls are not guaranteed to transfer full buffers, andcall()reads response outside the lock, so concurrent calls can interleave responses on the same socket.Also applies to: 91-96
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_client.cpp` around lines 58 - 73, RpcClient::call currently assumes single send()/recv() complete transfers and reads the response outside any synchronization, which can cause partial reads/writes and interleaved responses across concurrent calls; fix by (1) implementing and using robust readn/writeN helpers (e.g., readFull/writeFull) that loop until the requested byte count is transferred or an error/EOF occurs, and replace the direct recv(fd_, ...) calls in call (and the similar block at the other occurrence) with readFull to fully read the 8-byte RpcHeader and then the full payload; and (2) serialize per-call socket use by protecting the entire send+receive sequence in RpcClient::call (including the send_only invocation or replace it with writeFull) with a mutex (e.g., a member std::mutex) so concurrent calls cannot interleave on fd_. Ensure errors from partial transfers propagate as failures from call().
79-81:⚠️ Potential issue | 🔴 CriticalFix self-deadlock in
send_only()reconnect path.Line 80 calls
connect()whilemutex_is already held (Line 79), andconnect()also locksmutex_. Withstd::mutex, this can deadlock.Suggested fix
-bool RpcClient::send_only(RpcType type, const std::vector<uint8_t>& payload) { - const std::scoped_lock<std::mutex> lock(mutex_); - if (fd_ < 0 && !connect()) { - return false; - } +bool RpcClient::send_only(RpcType type, const std::vector<uint8_t>& payload) { + std::unique_lock<std::mutex> lock(mutex_); + if (fd_ < 0) { + lock.unlock(); + if (!connect()) { + return false; + } + lock.lock(); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_client.cpp` around lines 79 - 81, send_only() currently holds std::scoped_lock<std::mutex> lock(mutex_) and then calls connect(), but connect() also locks mutex_, causing a self-deadlock; fix by checking fd_ under the scoped_lock, then release the lock before invoking connect() (e.g., drop the scoped_lock or limit its scope) and after connect() re-acquire the mutex to re-check fd_/state as needed; reference the send_only() function, the connect() method, the mutex_ member and fd_ variable when making this change.src/distributed/distributed_executor.cpp (2)
60-73:⚠️ Potential issue | 🔴 CriticalAsync tasks capture loop variable
nodeby reference.Each future may run after the loop advances, so
&nodeis unsafe here.✅ Capture `node` by value
- prepare_futures.push_back(std::async(std::launch::async, [&node, payload]() { + prepare_futures.push_back(std::async(std::launch::async, [node, payload]() { @@ - std::async(std::launch::async, [&node, payload, phase2_type]() { + std::async(std::launch::async, [node, payload, phase2_type]() { @@ - rollback_futures.push_back(std::async(std::launch::async, [&node, payload]() { + rollback_futures.push_back(std::async(std::launch::async, [node, payload]() {#!/bin/bash # Verify no async lambda still captures loop variable `node` by reference. rg -nP 'std::async\([^)]*\[\s*&node' src/distributed/distributed_executor.cpp # Expected: no matchesAlso applies to: 92-99, 117-125
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/distributed/distributed_executor.cpp` around lines 60 - 73, Loop-captured variable `node` is captured by reference in the std::async lambdas (e.g., the lambda pushed into prepare_futures), which is unsafe; change the lambda captures to capture node by value (e.g., replace [&node, payload] with [node, payload]) for the prepare_futures async call and do the same for the other async lambdas referenced (the similar std::async calls around lines 92-99 and 117-125) so each future holds its own copy of node instead of referencing a loop variable.
32-37:⚠️ Potential issue | 🔴 CriticalDDL path still returns success without applying any change.
This silently acknowledges DDL while doing nothing.
🚫 Safer interim behavior
if (type == parser::StmtType::CreateTable || type == parser::StmtType::DropTable || type == parser::StmtType::CreateIndex || type == parser::StmtType::DropIndex) { - // These are handled by Raft via the Catalog locally on the leader - // and replicated to followers. - return QueryResult(); // Default is success + QueryResult res; + res.set_error("Distributed DDL execution is not implemented"); + return res; }
🟡 Minor comments (5)
src/network/rpc_client.cpp-1-103 (1)
1-103:⚠️ Potential issue | 🟡 MinorFormatting check is failing for this file.
CI reports
clang-formatfailure; reformat this file to satisfy pipeline checks.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_client.cpp` around lines 1 - 103, The file fails clang-format; reformat the RpcClient implementation to match the project's formatting rules (e.g., run clang-format with the repo's config) so indentation, spacing, and brace placement are correct across the RpcClient class and its methods (RpcClient::connect, RpcClient::~RpcClient, RpcClient::disconnect, RpcClient::call, RpcClient::send_only), ensuring includes, namespace closing, and function signatures follow the configured style.src/network/rpc_server.cpp-1-119 (1)
1-119:⚠️ Potential issue | 🟡 MinorFormatting check is failing for this file.
CI reports
clang-formatfailure; please run formatting on this translation unit to unblock the pipeline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_server.cpp` around lines 1 - 119, This file fails clang-format; reformat the translation unit (rpc_server.cpp) to match project style by running clang-format (or your project's format command) and committing the updated file so CI passes—apply formatting across the entire file including functions RpcServer::start, RpcServer::stop, RpcServer::set_handler, RpcServer::accept_loop and RpcServer::handle_client, then re-run CI to ensure the clang-format check succeeds.src/network/server.cpp-95-107 (1)
95-107:⚠️ Potential issue | 🟡 MinorFormatting check is failing for this file in CI.
Please run
clang-format -ion this file to unblock the pipeline.Also applies to: 122-122, 261-261, 268-268
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/network/server.cpp` around lines 95 - 107, Run clang-format -i on src/network/server.cpp to fix the CI formatting failures; reformat the file so the constructor Server::Server(...) initializer list and the factory Server::create(...) follow the project's clang-format rules (ensure lines around Server::Server, Server::create, and the regions mentioned at lines ~122, ~261, ~268 are reformatted). Commit the reformatted file so CI formatting checks pass.tests/distributed_txn_tests.cpp-36-39 (1)
36-39:⚠️ Potential issue | 🟡 MinorMissing null checks before dereferencing parsed statements.
Please assert parser success before
*stmtin each test.🧪 Proposed fix pattern
Parser parser(std::move(lexer)); auto stmt = parser.parse_statement(); + ASSERT_NE(stmt, nullptr); auto res = exec.execute(*stmt, "COMMIT");Also applies to: 101-104, 209-212
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/distributed_txn_tests.cpp` around lines 36 - 39, The tests dereference the parsed statement pointer (*stmt) without asserting parsing succeeded; add a null-check assertion (e.g., ASSERT_TRUE(stmt) or ASSERT_NE(stmt, nullptr)) immediately after calling parser.parse_statement() before using *stmt in calls like exec.execute(*stmt, "COMMIT") (apply the same fix for the other occurrences around the blocks with parser.parse_statement() at the noted locations).tests/distributed_tests.cpp-43-46 (1)
43-46:⚠️ Potential issue | 🟡 MinorGuard parser output before dereference.
Add
ASSERT_NE(stmt, nullptr)before*stmt.🧪 Proposed fix
Parser parser(std::move(lexer)); auto stmt = parser.parse_statement(); + ASSERT_NE(stmt, nullptr); auto res = exec.execute(*stmt, "CREATE TABLE test (id INT)");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/distributed_tests.cpp` around lines 43 - 46, The test dereferences the result of parser.parse_statement() without checking for null; add a guard asserting the pointer is non-null before dereference by inserting ASSERT_NE(stmt, nullptr) after auto stmt = parser.parse_statement(), so that the subsequent call exec.execute(*stmt, "CREATE TABLE test (id INT)") is safe and the test fails clearly if parsing returns null.
🧹 Nitpick comments (4)
include/distributed/distributed_executor.hpp (1)
1-5: Clean up duplicated doc-comment opener.Line 2 repeats
/**inside the existing comment block; removing it improves header clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/distributed/distributed_executor.hpp` around lines 1 - 5, The file-level doc comment at the top of distributed_executor.hpp contains a duplicated opener ("/*" or "/**") on the second line; remove the extra "/**" so the header comment is a single well-formed block comment describing the file (leaving the existing lines with "@file distributed_executor.hpp" and "@brief High-level executor for distributed queries" intact).include/distributed/shard_manager.hpp (1)
9-12: Make the header self-contained with direct std includes.This file uses
std::optionalandstd::hashbut does not include<optional>and<functional>directly. Add them explicitly to avoid transitive-include fragility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/distributed/shard_manager.hpp` around lines 9 - 12, The header is not self-contained: it uses std::optional and std::hash but only includes <string>, <vector>, <cstdint>; add direct includes for <optional> and <functional> at the top of include/distributed/shard_manager.hpp so symbols like std::optional and std::hash resolve without relying on transitive includes (ensure the new includes appear alongside the existing `#include` lines).tests/distributed_txn_tests.cpp (1)
107-109: Avoid fixed sleeps for synchronization in deterministic unit tests.These waits are unnecessary if
execute()already blocks until async phases complete.Also applies to: 218-220
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/distributed_txn_tests.cpp` around lines 107 - 109, Remove the fixed std::this_thread::sleep_for(...) calls and rely on execute() to block until async phases complete; specifically delete the sleep before the EXPECT_EQ(commit_count.load(), 2) check (and the analogous sleep at the other occurrence) so the test asserts immediately after txn.execute() or, if execute() cannot be relied on, replace the sleep with a proper wait on commit_count using a small-timeout loop or condition_variable tied to commit_count rather than a fixed sleep. Ensure references to commit_count and execute() remain the synchronization points in the test.tests/raft_simulation_tests.cpp (1)
33-38:FollowerToCandidatehas no assertion after the election window.Right now it cannot fail on missing transition behavior. Please add an observable post-timeout expectation (or expose state for tests).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/raft_simulation_tests.cpp` around lines 33 - 38, The test FollowerToCandidate currently only sleeps and never asserts the node's post-election state; add a concrete expectation after the sleep that verifies the node transitioned (e.g., assert node->getState() == State::Candidate || node->getState() == State::Leader). If no accessor exists, expose the state for tests by adding a read-only accessor like RaftNode::getState() (or make the state member test-visible) and use that in the test assertion to fail when the transition did not occur.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@include/network/rpc_client.hpp`:
- Line 27: The is_connected() accessor reads fd_ without synchronization causing
a data race against connect() and disconnect() which hold mutex_; fix by
synchronizing access: either acquire the same mutex_ inside is_connected() (lock
mutex_ and return fd_ >= 0) or change fd_ to std::atomic<int> and update
connect()/disconnect() accordingly; reference is_connected(), connect(),
disconnect(), fd_, and mutex_ to locate the changes and ensure all reads/writes
use the chosen synchronization method.
In `@include/network/rpc_message.hpp`:
- Around line 148-157: The code serializes txn_id in TxnOperationArgs by
memcpy’ing host-endian uint64_t directly; update TxnOperationArgs::serialize to
convert txn_id to network byte order with htonll() before writing into the
output buffer and update TxnOperationArgs::deserialize to read the 8 bytes then
convert back with ntohll() into args.txn_id; apply the same pattern to other
numeric fields mentioned (e.g., QueryResultsReply::err_len and row_count) so all
multi-byte integer fields use htonll/ntohll (or appropriate hton*/ntoh* for
sizes) when serializing/deserializing.
In `@src/catalog/catalog.cpp`:
- Around line 75-82: When raft_node_ is set the code currently calls
create_table_local and returns local success without replicating via Raft;
change Catalog::create_table so that if raft_node_ != nullptr it does not
perform local-only DDL but instead returns/throws an explicit "not supported in
raft mode" error (or logs and aborts the operation) until the Raft
serialization/replication is implemented, rather than calling
create_table_local; apply the same change pattern to the other DDL paths that
currently bypass replication (e.g., any other methods that check raft_node_ and
call create_*_local or drop_*_local).
- Around line 95-100: The code currently hard-codes a default shard to localhost
by creating a ShardInfo with shard.shard_id = 0, shard.node_address =
"127.0.0.1", shard.port = 6432 and pushing it into table->shards; replace this
static assignment with a configurable or discovery-backed default: read the
shard address/port from configuration or environment (or call an existing helper
like getLocalAddress()/getClusterConfig() if available) and populate ShardInfo
from those values, falling back to a safe sentinel (e.g., empty node_address or
INADDR_ANY/0) rather than "127.0.0.1"; keep the same symbols (ShardInfo,
shard.shard_id, shard.node_address, shard.port, table->shards.push_back) so
changes are localized.
In `@src/distributed/distributed_executor.cpp`:
- Around line 47-55: The code incorrectly hardcodes GLOBAL_TXN_ID = 1 causing
cross-wiring of independent transactions; instead remove the constexpr and get
the real transaction id for this operation (e.g., from the execution context or
transaction manager) and assign it to network::TxnOperationArgs::txn_id before
serialize()—for example, obtain currentTxnId via your
TransactionManager/ExecutionContext API (or call a
TransactionIdGenerator/BeginTxn helper) and use that id where
parser::StmtType::TransactionCommit is handled so each commit/rollback uses the
correct unique txn id rather than GLOBAL_TXN_ID.
- Around line 86-107: Phase-2 RPC failures are currently ignored: when sending
TxnCommit via phase2_futures the result of network::RpcClient::call is discarded
so all_prepared can still return success even if some commits failed; update the
phase-2 loop to collect and enforce outcomes by having each async task return a
status (bool or error string) instead of void (e.g., change phase2_futures to
std::vector<std::future<ResultType>> and have the lambda capture
phase2_type/payload and return success/failure based on client.call), then after
joining examine all futures and if any commit failed set QueryResult to an error
(using res.set_error(...)) and return failure; ensure you reference phase2_type,
phase2_futures, network::RpcClient, client.call, all_prepared, and QueryResult
when making the change.
In `@src/distributed/raft_node.cpp`:
- Around line 47-53: The RpcServer handlers installed via
rpc_server_.set_handler for RpcType::RequestVote and RpcType::AppendEntries (and
the other handlers at 55-61) capture this and are not removed on RaftNode
shutdown; update RaftNode::stop() to unregister or replace those handlers with
non-capturing no-op handlers (or null/empty callbacks) so RpcServer will not
call into a destroyed RaftNode; specifically remove or overwrite handlers set by
rpc_server_.set_handler(...) for RequestVote and AppendEntries (and the other
listed RpcType handlers) in stop(), and ensure any replacement handlers do not
capture this and safely ignore incoming RPCs.
- Around line 196-225: The mutex_ is held across blocking send() calls which can
stall Raft; instead, build and persist all state changes while holding mutex_
(updates to persistent_state_, last_heartbeat_, reply.vote_granted, calls to
persist_state() and step_down(term)), then copy the reply into independent
buffers (e.g. std::vector<uint8_t> and header char[8]) while still holding the
lock or immediately after making a local copy, release mutex_, and perform
send(client_fd, ...) only after unlocking. Apply the same pattern to the other
handler (lines ~237-262): do state mutations under mutex_, prepare/send buffers
after unlocking to avoid holding mutex_ during network I/O.
- Around line 190-195: Before constructing candidate_id from payload, validate
payload length against the parsed id_len: after reading term and id_len
(variables term and id_len using payload), check that payload.size() >= 16 +
id_len and handle the error case (e.g., return an error, throw, or log and
ignore the packet) to avoid out-of-bounds reads; update the code around the
candidate_id construction (where candidate_id is created from payload.data() +
16) to perform this bounds check and use the safe path when the check fails.
- Around line 82-87: The predicate passed to cv_.wait_for re-calls
get_random_timeout(), re-randomizing the threshold; capture the previously
computed timeout and use it in the lambda instead. Update the cv_.wait_for call
so the lambda captures the local timeout (e.g., [this, timeout] or [timeout,
this]) and compares (std::chrono::system_clock::now() - last_heartbeat_ >
timeout) rather than calling get_random_timeout() again; keep checks for
!running_ and last_heartbeat_ unchanged.
In `@src/main.cpp`:
- Around line 149-157: The command-line parser currently maps any unknown --mode
value to cloudsql::config::RunMode::Standalone which hides typos; change the
else branch in the parsing block that reads mode_str to validate allowed values
("coordinator" or "distributed" -> cloudsql::config::RunMode::Coordinator,
"data" -> cloudsql::config::RunMode::Data, and optionally accept "standalone" ->
cloudsql::config::RunMode::Standalone) and if mode_str is none of those, print a
clear error mentioning permitted values (use the same variable names like
mode_str and config.mode/RunMode) and terminate with a non-zero exit (e.g.,
return 1 or std::exit(1)) so invalid modes are rejected instead of silently
defaulting.
- Around line 270-351: The 2PC RPC handlers currently assume success even when
the transaction doesn't exist; update the TxnPrepare, TxnCommit, and TxnAbort
handlers to validate the transaction returned by
transaction_manager.get_transaction(args.txn_id) and return failure if it's null
(set reply.success = false and a descriptive reply.error_msg). In the TxnPrepare
handler (RpcType::TxnPrepare) also perform the real prepare operation on the
found txn (e.g., call transaction_manager.prepare(txn) or the equivalent prepare
method) before setting reply.success = true, and catch/propagate errors from
that prepare call; for TxnCommit and TxnAbort only call
transaction_manager.commit(txn) / transaction_manager.abort(txn) when txn is
non-null and otherwise set failure with an error message.
- Around line 259-267: The RPC reply writes are unsafe: fix the code paths that
build and send responses (where reply.serialize(), cloudsql::network::RpcHeader
resp_h.encode(...), and send(fd, ...) are used) by (1) validating resp_p.size()
fits into uint16_t before assigning resp_h.payload_len and return/log an error
if too large to avoid silent truncation, (2) performing looped writes that check
send(...) return values and retry until the full header and payload are written
(handle partial writes), and (3) protect against SIGPIPE by using MSG_NOSIGNAL
on send or installing a SIGPIPE handler and treating send errors (including
EPIPE/EINTR) by aborting the send, closing the socket, and logging the errno;
apply these changes for every place where resp_h/payload and send are used.
In `@src/network/rpc_client.cpp`:
- Line 86: The assignment header.payload_len =
static_cast<uint16_t>(payload.size()) can silently truncate payloads > 65535;
before narrowing, check payload.size() <= std::numeric_limits<uint16_t>::max()
(or UINT16_MAX) and reject/handle oversized payloads (e.g., log error and
return/fail the RPC send) instead of casting. Update the code around the
header.payload_len assignment in rpc_client.cpp to perform this validation and
use the existing error/reporting mechanism (or throw) if the payload is too
large so you never store a truncated value in header.payload_len.
In `@src/network/rpc_server.cpp`:
- Around line 52-67: RpcServer::stop currently sets running_ = false and joins
accept_thread_ and each thread in worker_threads_ but worker threads in
handle_client remain blocked in recv(); to fix, ensure each client socket is
shutdown/closed before joining threads: track/iterate over client FDs (or store
connection wrappers associated with worker_threads_), call shutdown(fd,
SHUT_RDWR) and close(fd) (or socket::close equivalent) to unblock recv in
handle_client, then join accept_thread_ and worker_threads_, and finally clear
worker_threads_ and reset listen_fd_; update RpcServer::stop and ensure
handle_client checks running_ after recv returns to exit cleanly.
In `@tests/distributed_tests.cpp`:
- Around line 45-47: The test calls exec.execute(*stmt, "CREATE TABLE test (id
INT)") and only checks res.success(), which allows a silent no-op; update the
DDLRouting test to actually verify the side effect by asserting the table exists
after execution (e.g., invoke the appropriate catalog/table-existence check or
run a simple metadata/query like exec.execute("SELECT ... from
sqlite_master/...") and EXPECT_TRUE on that result). Locate the CREATE TABLE
invocation around exec.execute, stmt and res, then add a follow-up assertion
that confirms the "test" table is present (use the existing catalog API or exec
helper used elsewhere in tests).
In `@tests/server_tests.cpp`:
- Around line 102-118: The test currently hides connection failures and assumes
recv returns full frames; change the connect check to a hard test (use
ASSERT_EQ/ASSERT_TRUE on connect(...) == 0) so the test fails immediately if the
socket can't connect, and replace the direct recv calls with a read-exact loop
(or helper like read_exact) that repeatedly calls recv until the expected byte
count is received or an error/EOF occurs; apply this for both the 9-byte Auth
response and the 6-byte ReadyForQuery reads, and validate return values and
contents (check buffer[0] == 'R' and buffer[0] == 'Z') after the full frame is
read. Ensure you reference the existing variables/functions: connect, sock,
addr, startup/STARTUP_PKT_LEN, recv, buffer.
---
Outside diff comments:
In `@tests/server_tests.cpp`:
- Around line 1-124: Run clang-format -i on the test file that contains the
ServerTests (tests with TEST(ServerTests, StatusStrings), TEST(ServerTests,
Lifecycle) and TEST(ServerTests, Handshake)) to fix the formatting violations
reported by CI, then stage and commit the reformatted file so the formatting
gate passes; no code changes besides formatting should be made.
---
Minor comments:
In `@src/network/rpc_client.cpp`:
- Around line 1-103: The file fails clang-format; reformat the RpcClient
implementation to match the project's formatting rules (e.g., run clang-format
with the repo's config) so indentation, spacing, and brace placement are correct
across the RpcClient class and its methods (RpcClient::connect,
RpcClient::~RpcClient, RpcClient::disconnect, RpcClient::call,
RpcClient::send_only), ensuring includes, namespace closing, and function
signatures follow the configured style.
In `@src/network/rpc_server.cpp`:
- Around line 1-119: This file fails clang-format; reformat the translation unit
(rpc_server.cpp) to match project style by running clang-format (or your
project's format command) and committing the updated file so CI passes—apply
formatting across the entire file including functions RpcServer::start,
RpcServer::stop, RpcServer::set_handler, RpcServer::accept_loop and
RpcServer::handle_client, then re-run CI to ensure the clang-format check
succeeds.
In `@src/network/server.cpp`:
- Around line 95-107: Run clang-format -i on src/network/server.cpp to fix the
CI formatting failures; reformat the file so the constructor Server::Server(...)
initializer list and the factory Server::create(...) follow the project's
clang-format rules (ensure lines around Server::Server, Server::create, and the
regions mentioned at lines ~122, ~261, ~268 are reformatted). Commit the
reformatted file so CI formatting checks pass.
In `@tests/distributed_tests.cpp`:
- Around line 43-46: The test dereferences the result of
parser.parse_statement() without checking for null; add a guard asserting the
pointer is non-null before dereference by inserting ASSERT_NE(stmt, nullptr)
after auto stmt = parser.parse_statement(), so that the subsequent call
exec.execute(*stmt, "CREATE TABLE test (id INT)") is safe and the test fails
clearly if parsing returns null.
In `@tests/distributed_txn_tests.cpp`:
- Around line 36-39: The tests dereference the parsed statement pointer (*stmt)
without asserting parsing succeeded; add a null-check assertion (e.g.,
ASSERT_TRUE(stmt) or ASSERT_NE(stmt, nullptr)) immediately after calling
parser.parse_statement() before using *stmt in calls like exec.execute(*stmt,
"COMMIT") (apply the same fix for the other occurrences around the blocks with
parser.parse_statement() at the noted locations).
---
Duplicate comments:
In `@include/distributed/shard_manager.hpp`:
- Around line 26-33: compute_shard currently uses std::hash<std::string> which
is non-deterministic across builds; replace it with a deterministic FNV-1a
64-bit hash implementation inside compute_shard (use the bytes of
pk_value.to_string() or pk_value.data() if available), compute the 64-bit FNV-1a
hash, then take hash % num_shards and return as uint32_t; ensure the function
still handles num_shards==0 and casts the 64-bit result safely to uint32_t when
returning.
In `@include/network/rpc_message.hpp`:
- Around line 120-135: In QueryResultsReply::deserialize, avoid the OOB read and
non-portable integer parsing by first verifying in.size() >= 5 before copying
err_len and then verifying in.size() >= 5 + err_len + 4 before reading
row_count; replace the raw memcpy interpretation of err_len and row_count with
network-byte-order-aware conversion (e.g., read uint32_t from the byte stream
then apply ntohl/ntohl-equivalent) so values are normalized across hosts; update
the checks around reply.error_msg and reply.rows resizing to use these
validated, converted lengths and bail out or throw if sizes would overflow or
exceed in.size() bounds.
In `@include/network/rpc_server.hpp`:
- Around line 54-65: Add explicit client-lifecycle shutdown state and
coordination to allow stop() to unblock threads in recv(): introduce a
thread-safe container for active client fds (e.g., std::unordered_set<int>
client_fds_ with std::mutex client_fds_mutex_), an atomic<bool>
shutting_down_{false} (or similar) and a std::condition_variable shutdown_cv_ to
signal shutdown. Update accept_loop() to stop accepting when shutting_down_ is
set; register/deregister fds in handle_client() using client_fds_ and
periodically check shutting_down_ (or rely on shutdown_cv_) so the loop can
exit; implement stop() to set shutting_down_, lock client_fds_mutex_, close all
tracked fds to unblock recv(), notify shutdown_cv_, and join worker/accept
threads. Ensure handlers_ synchronization remains unchanged.
In `@src/distributed/distributed_executor.cpp`:
- Around line 60-73: Loop-captured variable `node` is captured by reference in
the std::async lambdas (e.g., the lambda pushed into prepare_futures), which is
unsafe; change the lambda captures to capture node by value (e.g., replace
[&node, payload] with [node, payload]) for the prepare_futures async call and do
the same for the other async lambdas referenced (the similar std::async calls
around lines 92-99 and 117-125) so each future holds its own copy of node
instead of referencing a loop variable.
In `@src/network/rpc_client.cpp`:
- Around line 58-73: RpcClient::call currently assumes single send()/recv()
complete transfers and reads the response outside any synchronization, which can
cause partial reads/writes and interleaved responses across concurrent calls;
fix by (1) implementing and using robust readn/writeN helpers (e.g.,
readFull/writeFull) that loop until the requested byte count is transferred or
an error/EOF occurs, and replace the direct recv(fd_, ...) calls in call (and
the similar block at the other occurrence) with readFull to fully read the
8-byte RpcHeader and then the full payload; and (2) serialize per-call socket
use by protecting the entire send+receive sequence in RpcClient::call (including
the send_only invocation or replace it with writeFull) with a mutex (e.g., a
member std::mutex) so concurrent calls cannot interleave on fd_. Ensure errors
from partial transfers propagate as failures from call().
- Around line 79-81: send_only() currently holds std::scoped_lock<std::mutex>
lock(mutex_) and then calls connect(), but connect() also locks mutex_, causing
a self-deadlock; fix by checking fd_ under the scoped_lock, then release the
lock before invoking connect() (e.g., drop the scoped_lock or limit its scope)
and after connect() re-acquire the mutex to re-check fd_/state as needed;
reference the send_only() function, the connect() method, the mutex_ member and
fd_ variable when making this change.
In `@src/network/rpc_server.cpp`:
- Around line 93-102: The recv calls reading the RPC frame are assuming full
reads; replace them with a robust "recv_all" style loop to ensure exactly 8
bytes are read into header_buf and then exactly header.payload_len bytes into
payload (or return/error if peer closes/short read). Update the code paths
around header_buf, RpcHeader::decode(header_buf), payload and client_fd to
call/use recv_all(client_fd, header_buf, 8) and recv_all(client_fd,
payload.data(), header.payload_len) (or equivalent helper that returns bytes
read or an error) and handle partial reads/EOF by breaking/closing the
connection before attempting to decode or process a truncated frame.
---
Nitpick comments:
In `@include/distributed/distributed_executor.hpp`:
- Around line 1-5: The file-level doc comment at the top of
distributed_executor.hpp contains a duplicated opener ("/*" or "/**") on the
second line; remove the extra "/**" so the header comment is a single
well-formed block comment describing the file (leaving the existing lines with
"@file distributed_executor.hpp" and "@brief High-level executor for distributed
queries" intact).
In `@include/distributed/shard_manager.hpp`:
- Around line 9-12: The header is not self-contained: it uses std::optional and
std::hash but only includes <string>, <vector>, <cstdint>; add direct includes
for <optional> and <functional> at the top of
include/distributed/shard_manager.hpp so symbols like std::optional and
std::hash resolve without relying on transitive includes (ensure the new
includes appear alongside the existing `#include` lines).
In `@tests/distributed_txn_tests.cpp`:
- Around line 107-109: Remove the fixed std::this_thread::sleep_for(...) calls
and rely on execute() to block until async phases complete; specifically delete
the sleep before the EXPECT_EQ(commit_count.load(), 2) check (and the analogous
sleep at the other occurrence) so the test asserts immediately after
txn.execute() or, if execute() cannot be relied on, replace the sleep with a
proper wait on commit_count using a small-timeout loop or condition_variable
tied to commit_count rather than a fixed sleep. Ensure references to
commit_count and execute() remain the synchronization points in the test.
In `@tests/raft_simulation_tests.cpp`:
- Around line 33-38: The test FollowerToCandidate currently only sleeps and
never asserts the node's post-election state; add a concrete expectation after
the sleep that verifies the node transitioned (e.g., assert node->getState() ==
State::Candidate || node->getState() == State::Leader). If no accessor exists,
expose the state for tests by adding a read-only accessor like
RaftNode::getState() (or make the state member test-visible) and use that in the
test assertion to fail when the transition did not occur.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
include/catalog/catalog.hppinclude/common/config.hppinclude/distributed/distributed_executor.hppinclude/distributed/raft_node.hppinclude/distributed/shard_manager.hppinclude/network/rpc_client.hppinclude/network/rpc_message.hppinclude/network/rpc_server.hppsrc/catalog/catalog.cppsrc/distributed/distributed_executor.cppsrc/distributed/raft_node.cppsrc/main.cppsrc/network/rpc_client.cppsrc/network/rpc_server.cppsrc/network/server.cppsrc/storage/storage_manager.cpptests/distributed_tests.cpptests/distributed_txn_tests.cpptests/raft_simulation_tests.cpptests/server_tests.cpp
✅ Files skipped from review due to trivial changes (1)
- src/storage/storage_manager.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- include/catalog/catalog.hpp
|
|
||
| bool connect(); | ||
| void disconnect(); | ||
| bool is_connected() const { return fd_ >= 0; } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the RPC client files
find . -type f -name "*rpc_client*" | head -20Repository: poyrazK/cloudSQL
Length of output: 121
🏁 Script executed:
# Look at the header file to understand the structure
cat -n include/network/rpc_client.hppRepository: poyrazK/cloudSQL
Length of output: 1487
🏁 Script executed:
# Find the implementation file
find . -type f -name "*.cpp" -o -name "*.cc" | xargs grep -l "rpc_client\|RpcClient" | head -10Repository: poyrazK/cloudSQL
Length of output: 163
🏁 Script executed:
# Now look at the implementation to see how fd_ is accessed
cat -n src/network/rpc_client.cppRepository: poyrazK/cloudSQL
Length of output: 3155
Guard is_connected() with the mutex to avoid a data race.
Line 27 reads fd_ without synchronization, while connect() (line 25) and disconnect() (line 50) write fd_ under mutex_. Make is_connected() lock mutex_ (or make fd_ atomic) to avoid undefined behavior under concurrent access.
Suggested fix
- bool is_connected() const { return fd_ >= 0; }
+ bool is_connected() const {
+ const std::scoped_lock<std::mutex> lock(mutex_);
+ return fd_ >= 0;
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/network/rpc_client.hpp` at line 27, The is_connected() accessor reads
fd_ without synchronization causing a data race against connect() and
disconnect() which hold mutex_; fix by synchronizing access: either acquire the
same mutex_ inside is_connected() (lock mutex_ and return fd_ >= 0) or change
fd_ to std::atomic<int> and update connect()/disconnect() accordingly; reference
is_connected(), connect(), disconnect(), fd_, and mutex_ to locate the changes
and ensure all reads/writes use the chosen synchronization method.
| [[nodiscard]] std::vector<uint8_t> serialize() const { | ||
| std::vector<uint8_t> out(8); | ||
| std::memcpy(out.data(), &txn_id, 8); | ||
| return out; | ||
| } | ||
|
|
||
| static TxnOperationArgs deserialize(const std::vector<uint8_t>& in) { | ||
| TxnOperationArgs args; | ||
| if (in.size() >= 8) { | ||
| std::memcpy(&args.txn_id, in.data(), 8); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the specific file and lines mentioned
cat -n include/network/rpc_message.hpp | head -170 | tail -30Repository: poyrazK/cloudSQL
Length of output: 1024
🏁 Script executed:
# Search for endianness-related patterns and functions in the codebase
rg -i "endian|hton|ntoh|byte.*order" --type-list | head -20Repository: poyrazK/cloudSQL
Length of output: 449
🏁 Script executed:
# Look for the definition of txn_id and TxnOperationArgs
rg "txn_id" include/network/rpc_message.hpp -B 5 -A 5Repository: poyrazK/cloudSQL
Length of output: 579
🏁 Script executed:
# Search for other serialization methods to see if they handle endianness
rg "serialize|deserialize" include/network/rpc_message.hpp -B 2 -A 10Repository: poyrazK/cloudSQL
Length of output: 2617
🏁 Script executed:
# Search for endianness handling utilities in the codebase
rg "hton|ntoh|endian|byte.*order" -iRepository: poyrazK/cloudSQL
Length of output: 2103
🏁 Script executed:
# Check how TxnOperationArgs is used (network context)
rg "TxnOperationArgs" -B 3 -A 3Repository: poyrazK/cloudSQL
Length of output: 3518
🏁 Script executed:
# Look for any network or protocol specifications
fd -e md -e txt | head -20Repository: poyrazK/cloudSQL
Length of output: 134
🏁 Script executed:
# Check if there are any comments about distributed/multi-arch scenarios
rg "distributed|architecture|endian|cluster" -i include/network/Repository: poyrazK/cloudSQL
Length of output: 673
Normalize txn_id endianness in network payloads.
TxnOperationArgs currently copies host-endian uint64_t directly without byte order conversion. Use htonll() on serialize and ntohll() on deserialize to ensure transaction IDs remain consistent across mixed-architecture clusters. This same issue affects other numeric fields like QueryResultsReply.err_len and row_count.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/network/rpc_message.hpp` around lines 148 - 157, The code serializes
txn_id in TxnOperationArgs by memcpy’ing host-endian uint64_t directly; update
TxnOperationArgs::serialize to convert txn_id to network byte order with
htonll() before writing into the output buffer and update
TxnOperationArgs::deserialize to read the 8 bytes then convert back with
ntohll() into args.txn_id; apply the same pattern to other numeric fields
mentioned (e.g., QueryResultsReply::err_len and row_count) so all multi-byte
integer fields use htonll/ntohll (or appropriate hton*/ntoh* for sizes) when
serializing/deserializing.
| oid_t Catalog::create_table(const std::string& table_name, std::vector<ColumnInfo> columns) { | ||
| if (raft_node_ != nullptr) { | ||
| /* TODO: Serialize DDL and replicate via Raft */ | ||
| /* For now, just call local to keep it working during Step 4 implementation */ | ||
| return create_table_local(table_name, std::move(columns)); | ||
| } | ||
| return create_table_local(table_name, std::move(columns)); | ||
| } |
There was a problem hiding this comment.
Raft-enabled path still performs local DDL without replication.
When raft_node_ is set, returning local success here can split catalog state across nodes.
🚧 Safer interim behavior
oid_t Catalog::create_table(const std::string& table_name, std::vector<ColumnInfo> columns) {
if (raft_node_ != nullptr) {
- /* TODO: Serialize DDL and replicate via Raft */
- /* For now, just call local to keep it working during Step 4 implementation */
- return create_table_local(table_name, std::move(columns));
+ throw std::runtime_error("Raft-replicated CREATE TABLE is not implemented yet");
}
return create_table_local(table_name, std::move(columns));
}Also applies to: 111-117
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/catalog/catalog.cpp` around lines 75 - 82, When raft_node_ is set the
code currently calls create_table_local and returns local success without
replicating via Raft; change Catalog::create_table so that if raft_node_ !=
nullptr it does not perform local-only DDL but instead returns/throws an
explicit "not supported in raft mode" error (or logs and aborts the operation)
until the Raft serialization/replication is implemented, rather than calling
create_table_local; apply the same change pattern to the other DDL paths that
currently bypass replication (e.g., any other methods that check raft_node_ and
call create_*_local or drop_*_local).
| /* Basic Shard Assignment */ | ||
| ShardInfo shard; | ||
| shard.shard_id = 0; | ||
| shard.node_address = "127.0.0.1"; // Default | ||
| shard.port = 6432; | ||
| table->shards.push_back(shard); |
There was a problem hiding this comment.
Default shard assignment is hard-coded to localhost.
All tables are pinned to 127.0.0.1:6432, which is unsafe for real cluster routing and failover behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/catalog/catalog.cpp` around lines 95 - 100, The code currently hard-codes
a default shard to localhost by creating a ShardInfo with shard.shard_id = 0,
shard.node_address = "127.0.0.1", shard.port = 6432 and pushing it into
table->shards; replace this static assignment with a configurable or
discovery-backed default: read the shard address/port from configuration or
environment (or call an existing helper like
getLocalAddress()/getClusterConfig() if available) and populate ShardInfo from
those values, falling back to a safe sentinel (e.g., empty node_address or
INADDR_ANY/0) rather than "127.0.0.1"; keep the same symbols (ShardInfo,
shard.shard_id, shard.node_address, shard.port, table->shards.push_back) so
changes are localized.
| // For simplicity, we assume a single active global transaction ID. | ||
| constexpr uint64_t GLOBAL_TXN_ID = 1; | ||
|
|
||
| if (type == parser::StmtType::TransactionCommit) { | ||
| std::string errors; | ||
|
|
||
| network::TxnOperationArgs args; | ||
| args.txn_id = GLOBAL_TXN_ID; | ||
| auto payload = args.serialize(); |
There was a problem hiding this comment.
Fixed GLOBAL_TXN_ID = 1 is unsafe for concurrent transactions.
All commits/rollbacks are multiplexed onto one ID, which can cross-wire independent transactions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/distributed/distributed_executor.cpp` around lines 47 - 55, The code
incorrectly hardcodes GLOBAL_TXN_ID = 1 causing cross-wiring of independent
transactions; instead remove the constexpr and get the real transaction id for
this operation (e.g., from the execution context or transaction manager) and
assign it to network::TxnOperationArgs::txn_id before serialize()—for example,
obtain currentTxnId via your TransactionManager/ExecutionContext API (or call a
TransactionIdGenerator/BeginTxn helper) and use that id where
parser::StmtType::TransactionCommit is handled so each commit/rollback uses the
correct unique txn id rather than GLOBAL_TXN_ID.
| rpc_server->set_handler( | ||
| cloudsql::network::RpcType::TxnPrepare, | ||
| [&](const cloudsql::network::RpcHeader& h, const std::vector<uint8_t>& p, | ||
| int fd) { | ||
| (void)h; | ||
| auto args = cloudsql::network::TxnOperationArgs::deserialize(p); | ||
| cloudsql::network::QueryResultsReply reply; | ||
| try { | ||
| // In a full implementation, we'd find the txn by ID and flush its WAL. | ||
| // For now, we just force a flush. | ||
| log_manager->flush(true); | ||
| reply.success = true; | ||
| } catch (const std::exception& e) { | ||
| reply.success = false; | ||
| reply.error_msg = e.what(); | ||
| } | ||
|
|
||
| auto resp_p = reply.serialize(); | ||
| cloudsql::network::RpcHeader resp_h; | ||
| resp_h.type = cloudsql::network::RpcType::QueryResults; | ||
| resp_h.payload_len = static_cast<uint16_t>(resp_p.size()); | ||
| char h_buf[8]; | ||
| resp_h.encode(h_buf); | ||
| send(fd, h_buf, 8, 0); | ||
| send(fd, resp_p.data(), resp_p.size(), 0); | ||
| }); | ||
|
|
||
| rpc_server->set_handler( | ||
| cloudsql::network::RpcType::TxnCommit, | ||
| [&](const cloudsql::network::RpcHeader& h, const std::vector<uint8_t>& p, | ||
| int fd) { | ||
| (void)h; | ||
| auto args = cloudsql::network::TxnOperationArgs::deserialize(p); | ||
| cloudsql::network::QueryResultsReply reply; | ||
| try { | ||
| auto txn = transaction_manager.get_transaction(args.txn_id); | ||
| if (txn) { | ||
| transaction_manager.commit(txn); | ||
| } | ||
| reply.success = true; | ||
| } catch (const std::exception& e) { | ||
| reply.success = false; | ||
| reply.error_msg = e.what(); | ||
| } | ||
|
|
||
| auto resp_p = reply.serialize(); | ||
| cloudsql::network::RpcHeader resp_h; | ||
| resp_h.type = cloudsql::network::RpcType::QueryResults; | ||
| resp_h.payload_len = static_cast<uint16_t>(resp_p.size()); | ||
| char h_buf[8]; | ||
| resp_h.encode(h_buf); | ||
| send(fd, h_buf, 8, 0); | ||
| send(fd, resp_p.data(), resp_p.size(), 0); | ||
| }); | ||
|
|
||
| rpc_server->set_handler( | ||
| cloudsql::network::RpcType::TxnAbort, | ||
| [&](const cloudsql::network::RpcHeader& h, const std::vector<uint8_t>& p, | ||
| int fd) { | ||
| (void)h; | ||
| auto args = cloudsql::network::TxnOperationArgs::deserialize(p); | ||
| cloudsql::network::QueryResultsReply reply; | ||
| try { | ||
| auto txn = transaction_manager.get_transaction(args.txn_id); | ||
| if (txn) { | ||
| transaction_manager.abort(txn); | ||
| } | ||
| reply.success = true; | ||
| } catch (const std::exception& e) { | ||
| reply.success = false; | ||
| reply.error_msg = e.what(); | ||
| } | ||
|
|
||
| auto resp_p = reply.serialize(); | ||
| cloudsql::network::RpcHeader resp_h; | ||
| resp_h.type = cloudsql::network::RpcType::QueryResults; | ||
| resp_h.payload_len = static_cast<uint16_t>(resp_p.size()); | ||
| char h_buf[8]; | ||
| resp_h.encode(h_buf); | ||
| send(fd, h_buf, 8, 0); | ||
| send(fd, resp_p.data(), resp_p.size(), 0); | ||
| }); |
There was a problem hiding this comment.
2PC handlers currently acknowledge success for unknown transactions.
Line 280 in TxnPrepare is a WAL flush-only success path and does not validate/prepare the transaction.
Lines 305-337 in TxnCommit/TxnAbort also return success when get_transaction(args.txn_id) is null. This breaks 2PC guarantees.
✅ Suggested fix
rpc_server->set_handler(
cloudsql::network::RpcType::TxnPrepare,
[&](const cloudsql::network::RpcHeader& h, const std::vector<uint8_t>& p,
int fd) {
(void)h;
auto args = cloudsql::network::TxnOperationArgs::deserialize(p);
cloudsql::network::QueryResultsReply reply;
try {
- // In a full implementation, we'd find the txn by ID and flush its WAL.
- // For now, we just force a flush.
- log_manager->flush(true);
- reply.success = true;
+ auto txn = transaction_manager.get_transaction(args.txn_id);
+ if (!txn) {
+ reply.success = false;
+ reply.error_msg = "Unknown transaction id";
+ } else {
+ transaction_manager.prepare(txn);
+ log_manager->flush(true);
+ reply.success = true;
+ }
} catch (const std::exception& e) {
reply.success = false;
reply.error_msg = e.what();
}
@@
try {
auto txn = transaction_manager.get_transaction(args.txn_id);
- if (txn) {
- transaction_manager.commit(txn);
- }
- reply.success = true;
+ if (!txn) {
+ reply.success = false;
+ reply.error_msg = "Unknown transaction id";
+ } else {
+ transaction_manager.commit(txn);
+ reply.success = true;
+ }
} catch (const std::exception& e) {
@@
try {
auto txn = transaction_manager.get_transaction(args.txn_id);
- if (txn) {
- transaction_manager.abort(txn);
- }
- reply.success = true;
+ if (!txn) {
+ reply.success = false;
+ reply.error_msg = "Unknown transaction id";
+ } else {
+ transaction_manager.abort(txn);
+ reply.success = true;
+ }
} catch (const std::exception& e) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/main.cpp` around lines 270 - 351, The 2PC RPC handlers currently assume
success even when the transaction doesn't exist; update the TxnPrepare,
TxnCommit, and TxnAbort handlers to validate the transaction returned by
transaction_manager.get_transaction(args.txn_id) and return failure if it's null
(set reply.success = false and a descriptive reply.error_msg). In the TxnPrepare
handler (RpcType::TxnPrepare) also perform the real prepare operation on the
found txn (e.g., call transaction_manager.prepare(txn) or the equivalent prepare
method) before setting reply.success = true, and catch/propagate errors from
that prepare call; for TxnCommit and TxnAbort only call
transaction_manager.commit(txn) / transaction_manager.abort(txn) when txn is
non-null and otherwise set failure with an error message.
|
|
||
| RpcHeader header; | ||
| header.type = type; | ||
| header.payload_len = static_cast<uint16_t>(payload.size()); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check if the file exists and get an overview
fd "rpc_client.cpp" -t fRepository: poyrazK/cloudSQL
Length of output: 86
🏁 Script executed:
# Read the relevant section of the file with context around line 86
wc -l src/network/rpc_client.cppRepository: poyrazK/cloudSQL
Length of output: 90
🏁 Script executed:
# Get the context around line 86
sed -n '70,110p' src/network/rpc_client.cppRepository: poyrazK/cloudSQL
Length of output: 885
🏁 Script executed:
# Get more context - find the function that contains line 86
sed -n '1,100p' src/network/rpc_client.cpp | head -100Repository: poyrazK/cloudSQL
Length of output: 2397
🏁 Script executed:
# Find the header file with RpcHeader definition
fd "rpc" -t f | head -20Repository: poyrazK/cloudSQL
Length of output: 207
🏁 Script executed:
# Search for RpcHeader definition
rg "struct RpcHeader|class RpcHeader" --type cppRepository: poyrazK/cloudSQL
Length of output: 110
🏁 Script executed:
# Look for the header file
fd "rpc_client.hpp" -t fRepository: poyrazK/cloudSQL
Length of output: 90
🏁 Script executed:
# Read the header file to understand RpcHeader
cat src/network/rpc_client.hppRepository: poyrazK/cloudSQL
Length of output: 118
🏁 Script executed:
# Read the RpcHeader definition
cat include/network/rpc_message.hppRepository: poyrazK/cloudSQL
Length of output: 4954
🏁 Script executed:
# Read the client header
cat include/network/rpc_client.hppRepository: poyrazK/cloudSQL
Length of output: 1144
Validate payload length before narrowing to uint16_t.
The direct cast of payload.size() to uint16_t on line 86 silently truncates payloads exceeding 65535 bytes, corrupting RPC message framing. Add validation to reject oversized payloads:
Suggested guard
+ if (payload.size() > std::numeric_limits<uint16_t>::max()) {
+ return false;
+ }
header.payload_len = static_cast<uint16_t>(payload.size());📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| header.payload_len = static_cast<uint16_t>(payload.size()); | |
| if (payload.size() > std::numeric_limits<uint16_t>::max()) { | |
| return false; | |
| } | |
| header.payload_len = static_cast<uint16_t>(payload.size()); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/network/rpc_client.cpp` at line 86, The assignment header.payload_len =
static_cast<uint16_t>(payload.size()) can silently truncate payloads > 65535;
before narrowing, check payload.size() <= std::numeric_limits<uint16_t>::max()
(or UINT16_MAX) and reject/handle oversized payloads (e.g., log error and
return/fail the RPC send) instead of casting. Update the code around the
header.payload_len assignment in rpc_client.cpp to perform this validation and
use the existing error/reporting mechanism (or throw) if the payload is too
large so you never store a truncated value in header.payload_len.
| void RpcServer::stop() { | ||
| running_ = false; | ||
| if (listen_fd_ >= 0) { | ||
| static_cast<void>(close(listen_fd_)); | ||
| listen_fd_ = -1; | ||
| } | ||
| if (accept_thread_.joinable()) { | ||
| accept_thread_.join(); | ||
| } | ||
| for (auto& t : worker_threads_) { | ||
| if (t.joinable()) { | ||
| t.join(); | ||
| } | ||
| } | ||
| worker_threads_.clear(); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Locate and examine the RpcServer implementation
find . -name "rpc_server.cpp" -o -name "rpc_server.h" | head -20Repository: poyrazK/cloudSQL
Length of output: 88
🏁 Script executed:
# Get an overview of the file structure
git ls-files | grep -E "(rpc_server|RpcServer)" | head -20Repository: poyrazK/cloudSQL
Length of output: 117
🏁 Script executed:
# Check file sizes
wc -l src/network/rpc_server.cpp include/network/rpc_server.hppRepository: poyrazK/cloudSQL
Length of output: 138
🏁 Script executed:
# Read the full implementation file
cat -n src/network/rpc_server.cppRepository: poyrazK/cloudSQL
Length of output: 3898
🏁 Script executed:
# Read the header file
cat -n include/network/rpc_server.hppRepository: poyrazK/cloudSQL
Length of output: 2179
stop() blocks indefinitely on worker thread joins due to blocked recv() calls.
Worker threads in handle_client() may be blocked in recv() calls (lines 93, 101) waiting for client data. When stop() is called, setting running_=false doesn't unblock these threads—they will remain blocked until data arrives or the socket times out. This causes stop() to hang at lines 61-64 waiting for the threads to join. Close or shutdown client sockets before joining worker threads to ensure graceful termination.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/network/rpc_server.cpp` around lines 52 - 67, RpcServer::stop currently
sets running_ = false and joins accept_thread_ and each thread in
worker_threads_ but worker threads in handle_client remain blocked in recv(); to
fix, ensure each client socket is shutdown/closed before joining threads:
track/iterate over client FDs (or store connection wrappers associated with
worker_threads_), call shutdown(fd, SHUT_RDWR) and close(fd) (or socket::close
equivalent) to unblock recv in handle_client, then join accept_thread_ and
worker_threads_, and finally clear worker_threads_ and reset listen_fd_; update
RpcServer::stop and ensure handle_client checks running_ after recv returns to
exit cleanly.
| auto res = exec.execute(*stmt, "CREATE TABLE test (id INT)"); | ||
| EXPECT_TRUE(res.success()); | ||
| } |
There was a problem hiding this comment.
DDLRouting test does not verify the side effect.
This currently passes on a silent no-op. Please assert table existence after execution.
✅ Proposed assertion
auto res = exec.execute(*stmt, "CREATE TABLE test (id INT)");
EXPECT_TRUE(res.success());
+ auto table = catalog->get_table_by_name("test");
+ EXPECT_TRUE(table.has_value());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/distributed_tests.cpp` around lines 45 - 47, The test calls
exec.execute(*stmt, "CREATE TABLE test (id INT)") and only checks res.success(),
which allows a silent no-op; update the DDLRouting test to actually verify the
side effect by asserting the table exists after execution (e.g., invoke the
appropriate catalog/table-existence check or run a simple metadata/query like
exec.execute("SELECT ... from sqlite_master/...") and EXPECT_TRUE on that
result). Locate the CREATE TABLE invocation around exec.execute, stmt and res,
then add a follow-up assertion that confirms the "test" table is present (use
the existing catalog API or exec helper used elsewhere in tests).
| if (connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)) == 0) { | ||
| // Send startup packet | ||
| const std::array<uint32_t, 2> startup = {htonl(static_cast<uint32_t>(STARTUP_PKT_LEN)), | ||
| htonl(196608)}; | ||
| send(sock, startup.data(), startup.size() * 4, 0); | ||
|
|
||
| // Receive Auth OK | ||
| std::array<char, 9> buffer{}; | ||
| ssize_t n = recv(sock, buffer.data(), 9, 0); | ||
| EXPECT_EQ(n, 9); | ||
| EXPECT_EQ(buffer[0], 'R'); | ||
|
|
||
| // Receive ReadyForQuery | ||
| n = recv(sock, buffer.data(), 6, 0); | ||
| EXPECT_EQ(n, 6); | ||
| EXPECT_EQ(buffer[0], 'Z'); | ||
| } |
There was a problem hiding this comment.
Handshake test currently allows false positives and flaky reads.
Line 102 gates all assertions behind if (connect(...) == 0), so the test can pass even when no connection is established. Also, Line 110 and Line 115 assume recv returns full frames in one call, which is not guaranteed for TCP streams.
✅ Suggested fix
- if (connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)) == 0) {
- // Send startup packet
- const std::array<uint32_t, 2> startup = {htonl(static_cast<uint32_t>(STARTUP_PKT_LEN)),
- htonl(196608)};
- send(sock, startup.data(), startup.size() * 4, 0);
-
- // Receive Auth OK
- std::array<char, 9> buffer{};
- ssize_t n = recv(sock, buffer.data(), 9, 0);
- EXPECT_EQ(n, 9);
- EXPECT_EQ(buffer[0], 'R');
-
- // Receive ReadyForQuery
- n = recv(sock, buffer.data(), 6, 0);
- EXPECT_EQ(n, 6);
- EXPECT_EQ(buffer[0], 'Z');
- }
+ ASSERT_EQ(connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)), 0);
+
+ const std::array<uint32_t, 2> startup = {htonl(static_cast<uint32_t>(STARTUP_PKT_LEN)),
+ htonl(196608)};
+ ASSERT_EQ(send(sock, startup.data(), startup.size() * 4, 0),
+ static_cast<ssize_t>(startup.size() * 4));
+
+ auto recv_exact = [&](char* dst, size_t len) -> bool {
+ size_t got = 0;
+ while (got < len) {
+ ssize_t n = recv(sock, dst + got, len - got, 0);
+ if (n <= 0) return false;
+ got += static_cast<size_t>(n);
+ }
+ return true;
+ };
+
+ std::array<char, 9> buffer{};
+ ASSERT_TRUE(recv_exact(buffer.data(), 9));
+ EXPECT_EQ(buffer[0], 'R');
+
+ ASSERT_TRUE(recv_exact(buffer.data(), 6));
+ EXPECT_EQ(buffer[0], 'Z');🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/server_tests.cpp` around lines 102 - 118, The test currently hides
connection failures and assumes recv returns full frames; change the connect
check to a hard test (use ASSERT_EQ/ASSERT_TRUE on connect(...) == 0) so the
test fails immediately if the socket can't connect, and replace the direct recv
calls with a read-exact loop (or helper like read_exact) that repeatedly calls
recv until the expected byte count is received or an error/EOF occurs; apply
this for both the 9-byte Auth response and the 6-byte ReadyForQuery reads, and
validate return values and contents (check buffer[0] == 'R' and buffer[0] ==
'Z') after the full frame is read. Ensure you reference the existing
variables/functions: connect, sock, addr, startup/STARTUP_PKT_LEN, recv, buffer.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/server_tests.cpp (1)
102-118:⚠️ Potential issue | 🟠 MajorHandshake test can silently pass without establishing connection.
The
if (connect(...) == 0)pattern at Line 102 allows the test to pass even when no connection is made. Additionally,recv()calls at Lines 110 and 115 assume full protocol frames are returned in a single call, which TCP does not guarantee.Replace
ifwithASSERT_EQfor the connect call and use a read-exact helper to ensure complete frame reception.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/server_tests.cpp` around lines 102 - 118, The test currently uses if (connect(... ) == 0) which can silently skip assertions; replace that conditional with an assertion (e.g., ASSERT_EQ(connect(sock, reinterpret_cast<struct sockaddr*>(&addr), sizeof(addr)), 0)) so the test fails on connect error, and replace the direct recv(...) calls with a small helper (e.g., readExact(sock, void* buf, size_t len)) that loops until exactly the requested number of bytes are read or fails; use that helper for both the 9-byte Auth frame and the 6-byte ReadyForQuery frame (asserting the return equals the expected length) before checking buffer[0] for 'R' and 'Z', and keep use of STARTUP_PKT_LEN and buffer unchanged.src/network/rpc_client.cpp (1)
96-102:⚠️ Potential issue | 🟠 MajorPartial
send()not handled.The
send()calls at lines 96 and 100 may return fewer bytes than requested on a non-blocking or interrupted socket. Both should loop until all bytes are sent.🔧 Proposed fix with send_all helper
+namespace { +ssize_t send_all(int fd, const void* buf, size_t len) { + size_t total = 0; + const auto* data = static_cast<const char*>(buf); + while (total < len) { + const ssize_t n = send(fd, data + total, len - total, 0); + if (n <= 0) return n; + total += static_cast<size_t>(n); + } + return static_cast<ssize_t>(total); +} +} // namespace - if (send(fd_, header_buf, 8, 0) <= 0) { + if (send_all(fd_, header_buf, 8) != 8) { return false; } if (!payload.empty()) { - if (send(fd_, payload.data(), payload.size(), 0) <= 0) { + if (send_all(fd_, payload.data(), payload.size()) != + static_cast<ssize_t>(payload.size())) { return false; } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/network/rpc_client.cpp` around lines 96 - 102, The send() calls can return partial writes; implement a send_all helper (e.g., ssize_t send_all(int fd, const char* buf, size_t len, int flags)) that loops until all bytes are sent or a non-recoverable error occurs (retry on EINTR, handle EAGAIN/EWOULDBLOCK appropriately or propagate failure). Replace the direct calls that use fd_, header_buf, payload.data()/payload.size() with send_all to ensure the full 8-byte header and the full payload are transmitted, returning false on unrecoverable errors and true when all bytes are sent.
🧹 Nitpick comments (3)
tests/server_tests.cpp (3)
65-65: Consider checking socket creation for failure.
socket()returns -1 on failure. While unlikely in tests, checking this would prevent misleading failures.🛡️ Proposed fix
- int sock = socket(AF_INET, SOCK_STREAM, 0); + int sock = socket(AF_INET, SOCK_STREAM, 0); + ASSERT_NE(sock, -1) << "Failed to create socket";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/server_tests.cpp` at line 65, The socket() call that assigns to variable 'sock' should be checked for failure (socket() returns -1) — after int sock = socket(AF_INET, SOCK_STREAM, 0); add a check for sock == -1 and handle the error (e.g., call perror or log the error and fail the test / return early) to avoid misleading test failures; ensure any allocated resources are cleaned up before exiting the test.
106-106:send()return value is discarded.The return value of
send()should be checked to ensure the full startup packet was transmitted. This helps diagnose failures when the handshake doesn't proceed as expected.🛡️ Proposed fix
- send(sock, startup.data(), startup.size() * 4, 0); + ssize_t sent = send(sock, startup.data(), startup.size() * 4, 0); + ASSERT_EQ(sent, static_cast<ssize_t>(startup.size() * 4));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/server_tests.cpp` at line 106, The call to send(sock, startup.data(), startup.size() * 4, 0) discards its return value so partial sends or errors go unnoticed; replace it with a send-loop that tracks bytes_sent versus total_len (total_len = startup.size() * 4), repeatedly calling send on the buffer pointer (startup.data() + bytes_sent) until bytes_sent == total_len, and on each iteration handle -1 return values by checking errno for EINTR (retry) or other errors (fail the test and log/ASSERT with errno), referencing the existing variables sock and startup and the send call to locate where to add the loop and error handling.
96-96: Check socket creation for failure.Similar to the Lifecycle test,
socket()can return -1 on failure.🛡️ Proposed fix
int sock = socket(AF_INET, SOCK_STREAM, 0); + ASSERT_NE(sock, -1) << "Failed to create socket";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/server_tests.cpp` at line 96, The socket creation call int sock = socket(AF_INET, SOCK_STREAM, 0); does not check for failure; update the test to verify the returned descriptor is valid (i.e., not -1) immediately after socket() (using the test framework's assert/expect) and handle the error path (fail the test or return) so subsequent calls don't use an invalid sock variable; reference the 'sock' variable and the socket(...) call in tests/server_tests.cpp when adding the check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@include/network/rpc_message.hpp`:
- Around line 139-143: The CI failed due to formatting around the end of the RPC
message class; run clang-format -i on the header containing the RpcMessage class
and fix whitespace/brace placement around the method that returns reply (the
function with "return reply;") so the closing braces and semicolon are formatted
per project style, then commit the formatted file.
---
Duplicate comments:
In `@src/network/rpc_client.cpp`:
- Around line 96-102: The send() calls can return partial writes; implement a
send_all helper (e.g., ssize_t send_all(int fd, const char* buf, size_t len, int
flags)) that loops until all bytes are sent or a non-recoverable error occurs
(retry on EINTR, handle EAGAIN/EWOULDBLOCK appropriately or propagate failure).
Replace the direct calls that use fd_, header_buf, payload.data()/payload.size()
with send_all to ensure the full 8-byte header and the full payload are
transmitted, returning false on unrecoverable errors and true when all bytes are
sent.
In `@tests/server_tests.cpp`:
- Around line 102-118: The test currently uses if (connect(... ) == 0) which can
silently skip assertions; replace that conditional with an assertion (e.g.,
ASSERT_EQ(connect(sock, reinterpret_cast<struct sockaddr*>(&addr),
sizeof(addr)), 0)) so the test fails on connect error, and replace the direct
recv(...) calls with a small helper (e.g., readExact(sock, void* buf, size_t
len)) that loops until exactly the requested number of bytes are read or fails;
use that helper for both the 9-byte Auth frame and the 6-byte ReadyForQuery
frame (asserting the return equals the expected length) before checking
buffer[0] for 'R' and 'Z', and keep use of STARTUP_PKT_LEN and buffer unchanged.
---
Nitpick comments:
In `@tests/server_tests.cpp`:
- Line 65: The socket() call that assigns to variable 'sock' should be checked
for failure (socket() returns -1) — after int sock = socket(AF_INET,
SOCK_STREAM, 0); add a check for sock == -1 and handle the error (e.g., call
perror or log the error and fail the test / return early) to avoid misleading
test failures; ensure any allocated resources are cleaned up before exiting the
test.
- Line 106: The call to send(sock, startup.data(), startup.size() * 4, 0)
discards its return value so partial sends or errors go unnoticed; replace it
with a send-loop that tracks bytes_sent versus total_len (total_len =
startup.size() * 4), repeatedly calling send on the buffer pointer
(startup.data() + bytes_sent) until bytes_sent == total_len, and on each
iteration handle -1 return values by checking errno for EINTR (retry) or other
errors (fail the test and log/ASSERT with errno), referencing the existing
variables sock and startup and the send call to locate where to add the loop and
error handling.
- Line 96: The socket creation call int sock = socket(AF_INET, SOCK_STREAM, 0);
does not check for failure; update the test to verify the returned descriptor is
valid (i.e., not -1) immediately after socket() (using the test framework's
assert/expect) and handle the error path (fail the test or return) so subsequent
calls don't use an invalid sock variable; reference the 'sock' variable and the
socket(...) call in tests/server_tests.cpp when adding the check.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
include/distributed/distributed_executor.hppinclude/network/rpc_message.hppsrc/network/rpc_client.cppsrc/network/rpc_server.cppsrc/network/server.cpptests/server_tests.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- include/distributed/distributed_executor.hpp
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (5)
src/main.cpp (3)
271-283:⚠️ Potential issue | 🟠 Major2PC handlers still acknowledge success for unknown transactions.
Line 280-283
TxnPreparedoes not validate/prepare bytxn_id, and Line 307-311 plus Line 335-339 return success whenget_transaction(...)is null. This breaks 2PC correctness.✅ Suggested fix
rpc_server->set_handler( cloudsql::network::RpcType::TxnPrepare, [&](const cloudsql::network::RpcHeader& h, const std::vector<uint8_t>& p, int fd) { (void)h; auto args = cloudsql::network::TxnOperationArgs::deserialize(p); - (void)args; cloudsql::network::QueryResultsReply reply; try { - // In a full implementation, we'd find the txn by ID and flush its WAL. - // For now, we just force a flush. - log_manager->flush(true); - reply.success = true; + auto txn = transaction_manager.get_transaction(args.txn_id); + if (!txn) { + reply.success = false; + reply.error_msg = "Unknown transaction id"; + } else { + transaction_manager.prepare(txn); + log_manager->flush(true); + reply.success = true; + } } catch (const std::exception& e) { reply.success = false; reply.error_msg = e.what(); } @@ try { auto txn = transaction_manager.get_transaction(args.txn_id); - if (txn) { - transaction_manager.commit(txn); - } - reply.success = true; + if (!txn) { + reply.success = false; + reply.error_msg = "Unknown transaction id"; + } else { + transaction_manager.commit(txn); + reply.success = true; + } } catch (const std::exception& e) { @@ try { auto txn = transaction_manager.get_transaction(args.txn_id); - if (txn) { - transaction_manager.abort(txn); - } - reply.success = true; + if (!txn) { + reply.success = false; + reply.error_msg = "Unknown transaction id"; + } else { + transaction_manager.abort(txn); + reply.success = true; + } } catch (const std::exception& e) {Also applies to: 299-312, 327-340
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.cpp` around lines 271 - 283, The TxnPrepare RPC handler (RpcType::TxnPrepare) and the other 2PC handlers must not acknowledge success for unknown transactions: fetch and validate the transaction by txn_id (using get_transaction(...)) before flushing/applying any work, and if get_transaction(...) returns null/none return a failure reply (set reply.success = false and include a descriptive error/errno in the reply) instead of proceeding; update the handlers referenced (the TxnPrepare lambda and the other 2PC lambdas around where get_transaction(...) is used) to early-return on a null transaction and only perform log_manager->flush or other commit/prepare actions when the transaction exists.
150-158:⚠️ Potential issue | 🟠 MajorReject invalid
--modevalues instead of silently falling back to Standalone.Line 156 currently maps any unknown mode string to
RunMode::Standalone, which can hide operator typos and boot the wrong role.✅ Suggested fix
} else if ((arg == "-m" || arg == "--mode") && i + 1 < cmd_args.size()) { const std::string& mode_str = cmd_args[++i]; if (mode_str == "coordinator" || mode_str == "distributed") { config.mode = cloudsql::config::RunMode::Coordinator; } else if (mode_str == "data") { config.mode = cloudsql::config::RunMode::Data; + } else if (mode_str == "standalone") { + config.mode = cloudsql::config::RunMode::Standalone; } else { - config.mode = cloudsql::config::RunMode::Standalone; + std::cerr << "Invalid mode: " << mode_str + << " (expected: standalone|coordinator|data)\n"; + return 1; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.cpp` around lines 150 - 158, The command-line parsing currently maps any unrecognized --mode value to cloudsql::config::RunMode::Standalone; change this to treat unknown values as an error: in the arg-handling block that reads mode_str (the branch that sets config.mode), validate that mode_str is exactly "coordinator"/"distributed" or "data" and if not, emit a clear error message (including the invalid mode), print usage/help, and exit with a non-zero status instead of assigning RunMode::Standalone; update any tests or callers expecting the old fallback behavior if necessary.
260-267:⚠️ Potential issue | 🔴 CriticalHarden RPC writes: handle partial sends, SIGPIPE/EPIPE, and payload length overflow.
Line 263/292/320/348 narrows payload size to
uint16_twithout bounds checks, and Line 266/267 etc. ignoresend()return values. This can truncate replies or corrupt frames under partial writes/disconnects.✅ Suggested fix direction
+#include <cerrno> +#include <limits> @@ +auto send_all = [](int fd, const uint8_t* data, size_t len) -> bool { + size_t off = 0; + while (off < len) { + ssize_t n = send(fd, data + off, len - off, MSG_NOSIGNAL); + if (n < 0 && errno == EINTR) continue; + if (n <= 0) return false; + off += static_cast<size_t>(n); + } + return true; +}; @@ - resp_h.payload_len = static_cast<uint16_t>(resp_p.size()); + if (resp_p.size() > std::numeric_limits<uint16_t>::max()) { + return; // or send explicit protocol error + } + resp_h.payload_len = static_cast<uint16_t>(resp_p.size()); char h_buf[8]; resp_h.encode(h_buf); - send(fd, h_buf, 8, 0); - send(fd, resp_p.data(), resp_p.size(), 0); + if (!send_all(fd, reinterpret_cast<const uint8_t*>(h_buf), sizeof(h_buf))) return; + if (!send_all(fd, resp_p.data(), resp_p.size())) return;#!/bin/bash set -euo pipefail echo "== payload_len narrow casts in src/main.cpp ==" rg -n -C2 'payload_len\s*=\s*static_cast<uint16_t>\s*\(' src/main.cpp echo echo "== raw send() callsites in src/main.cpp ==" rg -n -C2 '\bsend\s*\(' src/main.cpp echo echo "== SIGPIPE / no-signal protections in src/main.cpp ==" rg -n -C2 'MSG_NOSIGNAL|SO_NOSIGPIPE|SIGPIPE' src/main.cppAlso applies to: 289-296, 317-324, 345-352
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.cpp` around lines 260 - 267, The response write currently narrows resp_p.size() into resp_h.payload_len (uint16_t) and performs raw send(fd, ...) calls without checking return values or handling partial writes or SIGPIPE/EPIPE; to fix, validate that resp_p.size() fits in uint16_t before the static_cast and return/handle an error if it overflows, then replace the two raw send() calls around resp_h.encode(...) and resp_p.data() with a robust write loop (e.g., a send_all helper used by the code) that checks send()’s return value, handles short writes by advancing the buffer and retrying until all bytes are sent, and detects/EPIPE or other fatal errors to stop and close the fd; also suppress SIGPIPE for these sends (use MSG_NOSIGNAL on send or ensure SIGPIPE is ignored) and surface/log errors where appropriate (refer to resp_p, resp_h, payload_len, encode, and the send(fd, ...) callsites).include/network/rpc_message.hpp (2)
150-160:⚠️ Potential issue | 🟠 MajorNormalize
txn_idto network byte order inTxnOperationArgs.Lines 152 and 159 copy host-endian
uint64_tdirectly, sotxn_idcan be decoded incorrectly across mixed-endian nodes.Proposed fix
[[nodiscard]] std::vector<uint8_t> serialize() const { std::vector<uint8_t> out(8); - std::memcpy(out.data(), &txn_id, 8); + out[0] = static_cast<uint8_t>((txn_id >> 56) & 0xFF); + out[1] = static_cast<uint8_t>((txn_id >> 48) & 0xFF); + out[2] = static_cast<uint8_t>((txn_id >> 40) & 0xFF); + out[3] = static_cast<uint8_t>((txn_id >> 32) & 0xFF); + out[4] = static_cast<uint8_t>((txn_id >> 24) & 0xFF); + out[5] = static_cast<uint8_t>((txn_id >> 16) & 0xFF); + out[6] = static_cast<uint8_t>((txn_id >> 8) & 0xFF); + out[7] = static_cast<uint8_t>(txn_id & 0xFF); return out; } static TxnOperationArgs deserialize(const std::vector<uint8_t>& in) { TxnOperationArgs args; if (in.size() >= 8) { - std::memcpy(&args.txn_id, in.data(), 8); + args.txn_id = (static_cast<uint64_t>(in[0]) << 56) | + (static_cast<uint64_t>(in[1]) << 48) | + (static_cast<uint64_t>(in[2]) << 40) | + (static_cast<uint64_t>(in[3]) << 32) | + (static_cast<uint64_t>(in[4]) << 24) | + (static_cast<uint64_t>(in[5]) << 16) | + (static_cast<uint64_t>(in[6]) << 8) | + static_cast<uint64_t>(in[7]); } return args; }#!/bin/bash # Verify all RPC multi-byte fields avoid raw host-endian memcpy in this header. rg -nP 'std::memcpy\([^;]*&\s*(txn_id|err_len|row_count|magic|payload_len)\s*,\s*(4|8)\)' include/network/rpc_message.hpp -C2 rg -nP '\b(htonl|htons|ntohl|ntohs)\b' include/network/rpc_message.hpp -C2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_message.hpp` around lines 150 - 160, The serialize/deserialize for TxnOperationArgs currently memcpy host-endian txn_id directly; change both TxnOperationArgs::serialize and TxnOperationArgs::deserialize to convert txn_id to a fixed network byte order before writing and convert back after reading (e.g., use a portable 64-bit host-to-network and network-to-host conversion like htobe64/be64toh or an inline byte-swap/shift helper) so the serialized bytes are consistent across endianness and then memcpy or assign those bytes; update only the logic around txn_id in serialize() and deserialize() to perform the conversion.
98-113:⚠️ Potential issue | 🔴 CriticalFix wire-format safety in
QueryResultsReply(bounds + byte order).Line 129 reads 4 bytes before verifying
in.size() >= 5, which is an OOB risk on short payloads. Also, Lines 105/112/129/136 currently serialize/deserialize multi-byte integers in host byte order, which breaks cross-architecture RPC compatibility.Proposed fix
[[nodiscard]] std::vector<uint8_t> serialize() const { std::vector<uint8_t> out; out.push_back(success ? 1 : 0); uint32_t err_len = static_cast<uint32_t>(error_msg.size()); + uint32_t n_err_len = htonl(err_len); size_t offset = out.size(); out.resize(offset + 4 + err_len); - std::memcpy(out.data() + offset, &err_len, 4); + std::memcpy(out.data() + offset, &n_err_len, 4); std::memcpy(out.data() + offset + 4, error_msg.data(), err_len); // Simplified row count serialization uint32_t row_count = static_cast<uint32_t>(rows.size()); + uint32_t n_row_count = htonl(row_count); offset = out.size(); out.resize(offset + 4); - std::memcpy(out.data() + offset, &row_count, 4); + std::memcpy(out.data() + offset, &n_row_count, 4); return out; } static QueryResultsReply deserialize(const std::vector<uint8_t>& in) { QueryResultsReply reply; - if (in.empty()) { + if (in.size() < 5) { // success(1) + err_len(4) return reply; } reply.success = in[0] != 0; - uint32_t err_len = 0; - std::memcpy(&err_len, in.data() + 1, 4); - if (in.size() >= 5 + err_len) { - reply.error_msg = std::string(reinterpret_cast<const char*>(in.data() + 5), err_len); - } + uint32_t err_len_net = 0; + std::memcpy(&err_len_net, in.data() + 1, 4); + const uint32_t err_len = ntohl(err_len_net); + if (in.size() < 5ULL + err_len) return reply; + reply.error_msg = std::string(reinterpret_cast<const char*>(in.data() + 5), err_len); - uint32_t row_count = 0; - if (in.size() >= 9 + err_len) { - std::memcpy(&row_count, in.data() + 5 + err_len, 4); - reply.rows.resize(row_count); // Placeholders - } + if (in.size() < 9ULL + err_len) return reply; + uint32_t row_count_net = 0; + std::memcpy(&row_count_net, in.data() + 5 + err_len, 4); + const uint32_t row_count = ntohl(row_count_net); + reply.rows.resize(row_count); // Placeholders return reply; }Also applies to: 120-137
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/network/rpc_message.hpp` around lines 98 - 113, The serialize/deserialize for QueryResultsReply (serialize() and the corresponding deserialization code) is unsafe: it reads fixed-size fields without verifying buffer length and uses host byte order for multi-byte integers; fix by validating buffer size before every read (ensure at least 1 + 4 + err_len + 4 + ... bytes exist before memcpy/reads) and convert all multi-byte integers to/from network byte order (use htonl on serialize for uint32_t fields like err_len and row_count and ntohl on deserialize) so the wire format is consistent across architectures; apply the same bounds checks and byte-order conversions to the other affected functions mentioned (the deserialize counterpart and any code around lines handling err_len/row_count).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@include/network/rpc_message.hpp`:
- Around line 134-138: The code unsafely calls reply.rows.resize(row_count) with
untrusted row_count from the wire; before resizing, validate row_count against a
safe maximum and ensure the buffer contains enough bytes for that many rows to
avoid integer overflow and excessive allocation. Concretely, in the block that
reads row_count (using variables row_count, err_len, in, and reply.rows.resize),
add checks: cap row_count to a defined MAX_ROWS constant, verify that
multiplying row_count by the per-row serialized size cannot overflow and that
in.size() is large enough to contain the expected row data, and only then call
reply.rows.resize(row_count); if any check fails, treat the message as malformed
and return an error.
- Around line 57-67: The RpcHeader::decode function currently assumes an 8-byte
buffer; change its API to accept a buffer length (e.g., decode(const char* in,
size_t len)) and validate len >= 8 before reading, returning an error (throw,
optional, or sentinel RpcHeader) on short input; update all call sites
(rpc_server.cpp and rpc_client.cpp) that call RpcHeader::decode to pass the
actual bytes received and handle the error path when fewer than 8 bytes were
returned by recv() so you never memcpy/read uninitialized/partial data.
---
Duplicate comments:
In `@include/network/rpc_message.hpp`:
- Around line 150-160: The serialize/deserialize for TxnOperationArgs currently
memcpy host-endian txn_id directly; change both TxnOperationArgs::serialize and
TxnOperationArgs::deserialize to convert txn_id to a fixed network byte order
before writing and convert back after reading (e.g., use a portable 64-bit
host-to-network and network-to-host conversion like htobe64/be64toh or an inline
byte-swap/shift helper) so the serialized bytes are consistent across endianness
and then memcpy or assign those bytes; update only the logic around txn_id in
serialize() and deserialize() to perform the conversion.
- Around line 98-113: The serialize/deserialize for QueryResultsReply
(serialize() and the corresponding deserialization code) is unsafe: it reads
fixed-size fields without verifying buffer length and uses host byte order for
multi-byte integers; fix by validating buffer size before every read (ensure at
least 1 + 4 + err_len + 4 + ... bytes exist before memcpy/reads) and convert all
multi-byte integers to/from network byte order (use htonl on serialize for
uint32_t fields like err_len and row_count and ntohl on deserialize) so the wire
format is consistent across architectures; apply the same bounds checks and
byte-order conversions to the other affected functions mentioned (the
deserialize counterpart and any code around lines handling err_len/row_count).
In `@src/main.cpp`:
- Around line 271-283: The TxnPrepare RPC handler (RpcType::TxnPrepare) and the
other 2PC handlers must not acknowledge success for unknown transactions: fetch
and validate the transaction by txn_id (using get_transaction(...)) before
flushing/applying any work, and if get_transaction(...) returns null/none return
a failure reply (set reply.success = false and include a descriptive error/errno
in the reply) instead of proceeding; update the handlers referenced (the
TxnPrepare lambda and the other 2PC lambdas around where get_transaction(...) is
used) to early-return on a null transaction and only perform log_manager->flush
or other commit/prepare actions when the transaction exists.
- Around line 150-158: The command-line parsing currently maps any unrecognized
--mode value to cloudsql::config::RunMode::Standalone; change this to treat
unknown values as an error: in the arg-handling block that reads mode_str (the
branch that sets config.mode), validate that mode_str is exactly
"coordinator"/"distributed" or "data" and if not, emit a clear error message
(including the invalid mode), print usage/help, and exit with a non-zero status
instead of assigning RunMode::Standalone; update any tests or callers expecting
the old fallback behavior if necessary.
- Around line 260-267: The response write currently narrows resp_p.size() into
resp_h.payload_len (uint16_t) and performs raw send(fd, ...) calls without
checking return values or handling partial writes or SIGPIPE/EPIPE; to fix,
validate that resp_p.size() fits in uint16_t before the static_cast and
return/handle an error if it overflows, then replace the two raw send() calls
around resp_h.encode(...) and resp_p.data() with a robust write loop (e.g., a
send_all helper used by the code) that checks send()’s return value, handles
short writes by advancing the buffer and retrying until all bytes are sent, and
detects/EPIPE or other fatal errors to stop and close the fd; also suppress
SIGPIPE for these sends (use MSG_NOSIGNAL on send or ensure SIGPIPE is ignored)
and surface/log errors where appropriate (refer to resp_p, resp_h, payload_len,
encode, and the send(fd, ...) callsites).
| static RpcHeader decode(const char* in) { | ||
| RpcHeader h; | ||
| uint32_t n_magic = 0; | ||
| uint16_t n_len = 0; | ||
| std::memcpy(&n_magic, in, 4); | ||
| h.magic = ntohl(n_magic); | ||
| h.type = static_cast<RpcType>(static_cast<uint8_t>(in[4])); | ||
| h.flags = static_cast<uint8_t>(in[5]); | ||
| std::memcpy(&n_len, in + 6, 2); | ||
| h.payload_len = ntohs(n_len); | ||
| return h; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the RpcHeader class and decode method
find . -type f -name "*.hpp" -o -name "*.h" -o -name "*.cpp" | head -20Repository: poyrazK/cloudSQL
Length of output: 714
🏁 Script executed:
# Search for RpcHeader::decode calls
rg -n 'RpcHeader::decode' --type cpp --type hppRepository: poyrazK/cloudSQL
Length of output: 87
🏁 Script executed:
# Look for HEADER_SIZE definition
rg -n 'HEADER_SIZE' --type cpp --type hppRepository: poyrazK/cloudSQL
Length of output: 87
🏁 Script executed:
# Find the actual decode function implementation
rg -n 'static.*RpcHeader.*decode' -A 15 --type cpp --type hppRepository: poyrazK/cloudSQL
Length of output: 87
🏁 Script executed:
# Search for decode calls without type filter
rg -n 'RpcHeader::decode'Repository: poyrazK/cloudSQL
Length of output: 263
🏁 Script executed:
# Search for HEADER_SIZE
rg -n 'HEADER_SIZE'Repository: poyrazK/cloudSQL
Length of output: 1440
🏁 Script executed:
# Read the rpc_message.hpp file to see the implementation
cat -n include/network/rpc_message.hppRepository: poyrazK/cloudSQL
Length of output: 6267
🏁 Script executed:
# Look for decode() function calls more broadly
rg -n '\.decode\(' -A 2 -B 2Repository: poyrazK/cloudSQL
Length of output: 1352
🏁 Script executed:
# Examine the rpc_server.cpp call site with context
sed -n '95,115p' src/network/rpc_server.cppRepository: poyrazK/cloudSQL
Length of output: 779
🏁 Script executed:
# Examine the rpc_client.cpp call site with context
sed -n '65,85p' src/network/rpc_client.cppRepository: poyrazK/cloudSQL
Length of output: 683
🏁 Script executed:
# Look for header_buf declaration and initialization in rpc_server.cpp
sed -n '80,110p' src/network/rpc_server.cppRepository: poyrazK/cloudSQL
Length of output: 1014
🏁 Script executed:
# Look for header_buf declaration and initialization in rpc_client.cpp
sed -n '50,80p' src/network/rpc_client.cppRepository: poyrazK/cloudSQL
Length of output: 843
Harden RpcHeader::decode API with input-length validation.
decode(const char* in) reads 8 bytes unconditionally. Both call sites in rpc_server.cpp (line 104) and rpc_client.cpp (line 74) use recv() to fill an 8-byte buffer but only check if recv() returned ≤ 0; they do not validate that exactly 8 bytes were received. If recv() returns fewer than 8 bytes (a valid partial read), decode() will read uninitialized memory from the buffer, causing an OOB read and undefined behavior.
Proposed fix
- static RpcHeader decode(const char* in) {
- RpcHeader h;
+ static bool decode(const char* in, size_t in_len, RpcHeader& h) {
+ if (in == nullptr || in_len < HEADER_SIZE) {
+ return false;
+ }
uint32_t n_magic = 0;
uint16_t n_len = 0;
std::memcpy(&n_magic, in, 4);
h.magic = ntohl(n_magic);
h.type = static_cast<RpcType>(static_cast<uint8_t>(in[4]));
h.flags = static_cast<uint8_t>(in[5]);
std::memcpy(&n_len, in + 6, 2);
h.payload_len = ntohs(n_len);
- return h;
+ return true;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/network/rpc_message.hpp` around lines 57 - 67, The RpcHeader::decode
function currently assumes an 8-byte buffer; change its API to accept a buffer
length (e.g., decode(const char* in, size_t len)) and validate len >= 8 before
reading, returning an error (throw, optional, or sentinel RpcHeader) on short
input; update all call sites (rpc_server.cpp and rpc_client.cpp) that call
RpcHeader::decode to pass the actual bytes received and handle the error path
when fewer than 8 bytes were returned by recv() so you never memcpy/read
uninitialized/partial data.
| uint32_t row_count = 0; | ||
| if (in.size() >= 9 + err_len) { | ||
| std::memcpy(&row_count, in.data() + 5 + err_len, 4); | ||
| reply.rows.resize(row_count); // Placeholders | ||
| } |
There was a problem hiding this comment.
Bound row_count before resizing to prevent memory exhaustion.
Line 137 trusts untrusted wire data and does rows.resize(row_count) directly. A malicious payload can force excessive allocation and crash the process.
Proposed fix
uint32_t row_count = 0;
if (in.size() >= 9 + err_len) {
std::memcpy(&row_count, in.data() + 5 + err_len, 4);
+ constexpr uint32_t kMaxPlaceholderRows = 100000;
+ if (row_count > kMaxPlaceholderRows) {
+ reply.success = false;
+ reply.error_msg = "row_count exceeds protocol limit";
+ return reply;
+ }
reply.rows.resize(row_count); // Placeholders
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/network/rpc_message.hpp` around lines 134 - 138, The code unsafely
calls reply.rows.resize(row_count) with untrusted row_count from the wire;
before resizing, validate row_count against a safe maximum and ensure the buffer
contains enough bytes for that many rows to avoid integer overflow and excessive
allocation. Concretely, in the block that reads row_count (using variables
row_count, err_len, in, and reply.rows.resize), add checks: cap row_count to a
defined MAX_ROWS constant, verify that multiplying row_count by the per-row
serialized size cannot overflow and that in.size() is large enough to contain
the expected row data, and only then call reply.rows.resize(row_count); if any
check fails, treat the message as malformed and return an error.
This PR implements the Phase 4 distributed operations:
Summary by CodeRabbit
New Features
Tests