Skip to content

Uninitialized memory read in depth_init_MathAPI from pair_pos buffer off-by-one #3

@TarekIbnZiad

Description

@TarekIbnZiad

Uninitialized memory read in depth_init_MathAPI from pair_pos buffer off-by-one

Environment

  • cuJSON commit: 2ac7d3dcd7ad1ff64ebdb14022bf94c59b3b4953 (branch master)
  • OS: Ubuntu 22.04.5 LTS
  • GPU: NVIDIA A40 (Ampere, sm_86)
  • NVIDIA driver: 590.48.01
  • CUDA toolkit: 13.1 (nvcc V13.1.115)

Component

cuJSON — Standard JSON parser, parse_standard_json.cu (parse_standard_json()Parser()depth_init_MathAPI(), consumed at parse_standard_json.cu:1393)

Severity

High (CWE-908, Use of Uninitialized Resource) — silent data corruption on every parse, with potential out-of-bounds access.

Description

In parse_standard_json, the result buffer res_buff is partitioned into structural and pair_pos arrays. The pair_pos pointer is set to res_buff + result_size + 1, but the cudaMemcpy that populates pair-position data writes starting at res_buff + result_size + 2. This off-by-one gap leaves pair_pos[0] (i.e. res_buff[result_size + 1]) as uninitialized pinned host memory. The uninitialized value propagates into the depth_init_MathAPI kernel (line 1393) and into the downstream CUB radix sort, corrupting the JSON parse tree for every input.

The bug is deterministic and input-independent: any valid JSON that produces at least one structural element triggers it (confirmed with {"key":123} and [1,2,3]).

Root cause

The function allocates one pinned buffer and partitions it into two contiguous arrays:

res_buff:
  [0 .. result_size-1]              -> structural (result_size int32_t)
  [result_size+1 .. 2*result_size]  -> pair_pos   (result_size int32_t)
parsed_tree.structural = res_buff;                  // index 0
parsed_tree.pair_pos   = res_buff + result_size + 1; // index result_size+1

The second cudaMemcpy writes to res_buff + result_size + 2, not + 1, leaving a 1-element gap:

cudaMallocHost(&res_buff, (2 * result_size + 2) * sizeof(int32_t));

cudaMemcpy(res_buff, dev_structural, result_size * sizeof(int32_t), cudaMemcpyDeviceToHost);

// note +2, not +1
cudaMemcpy(res_buff + result_size + 2, dev_pair_pos, result_size * sizeof(int32_t), cudaMemcpyDeviceToHost);

parsed_tree.pair_pos = res_buff + result_size + 1;  // index result_size+1 is NEVER written

So pair_pos[0] reads uninitialized pinned host memory from cudaMallocHost. depth_init_MathAPI then uses the pair positions to compute tree depth, operating on garbage; the corrupted pair data also flows into CUB's radix sort (block_load.cuh:130), confirming propagation through multiple kernels.

Impact

Category Assessment
Data Corruption High — uninitialized pair_pos[0] feeds depth_init_MathAPI and CUB radix sort, silently corrupting parse-tree depth/pair-matching for every document.
Information Leak Medium — uninitialized pinned host memory may hold residual data; if pair_pos values are exposed via API/serialized output, stale contents could leak.
Denial of Service Medium — a garbage pair_pos[0] used as an index in downstream kernels could cause an out-of-bounds access, GPU fault, or hang.

Reproduction

CUDA_HOME=${CUDA_HOME:-/usr/local/cuda}
nvcc -std=c++17 -arch=native -O2 -g -lineinfo \
  -I<cujson-src> -I${CUDA_HOME}/include \
  llm_harness_parse_standard_json.cu harness_afl_main.cpp \
  --compiler-bindir g++-13 -L${CUDA_HOME}/lib64 -lcudart \
  -o /tmp/cujson_parse_standard_json_initcheck.bin

compute-sanitizer --tool=initcheck --report-api-errors=no --error-exitcode 99 \
  /tmp/cujson_parse_standard_json_initcheck.bin trigger_input.json

Input is any valid JSON with structural tokens (e.g. {"key":123}).

Sanitizer evidence

========= Uninitialized __global__ memory read of size 4 bytes
=========   at depth_init_MathAPI(unsigned int *, unsigned int *, int, int)+0xa0 in parse_standard_json.cu:1393
=========   by thread (0,0,0) in block (0,0,0)
=========   Host Frame: Parser(...) 
=========   Host Frame: parse_standard_json(cuJSONInput)
=========   Host Frame: main
=========
========= Uninitialized __global__ memory read of size 1 bytes
=========   at cub::...::LoadDirectBlocked<...>+0x380 in block_load.cuh:130
=========   by thread (0,0,0) in block (0,0,0)
=========   Device Frame: cub::...::DeviceRadixSortSingleTileKernel<...>+0xb0 in radix_sort.cuh:401

Suggested fix

The root cause is the off-by-one mismatch between the pair_pos pointer (res_buff + result_size + 1) and the second cudaMemcpy destination (res_buff + result_size + 2). Any of the following resolves it.

Option A — align the pointer to the copy destination:

- parsed_tree.pair_pos = res_buff + result_size + 1;
+ parsed_tree.pair_pos = res_buff + result_size + 2;

Option B — align the copy destination to the pointer:

- cudaMemcpy(res_buff + result_size + 2, dev_pair_pos, result_size * sizeof(int32_t), ...);
+ cudaMemcpy(res_buff + result_size + 1, dev_pair_pos, result_size * sizeof(int32_t), ...);

Option C (preferred long term) — remove the unused gap and tighten the allocation:

- cudaMallocHost(&res_buff, (2 * result_size + 2) * sizeof(int32_t));
+ cudaMallocHost(&res_buff, (2 * result_size) * sizeof(int32_t));
- parsed_tree.pair_pos = res_buff + result_size + 1;
+ parsed_tree.pair_pos = res_buff + result_size;
- cudaMemcpy(res_buff + result_size + 2, dev_pair_pos, ...);
+ cudaMemcpy(res_buff + result_size, dev_pair_pos, result_size * sizeof(int32_t), ...);

Defensive hardening: zero-initialize the buffer after allocation (memset(res_buff, 0, total_size * sizeof(int32_t));) so any residual gap cannot leak.

Attachments

The following files from this finding are attached:

  • llm_harness_parse_standard_json.cu
  • harness_afl_main.cpp
  • trigger_input.json
  • initcheck.log

Archive.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions