Skip to content

feat(cuda): R4 DEEP composition + FRI commit phase on GPU#648

Open
ColoCarletti wants to merge 59 commits into
mainfrom
feat/cuda-pr4
Open

feat(cuda): R4 DEEP composition + FRI commit phase on GPU#648
ColoCarletti wants to merge 59 commits into
mainfrom
feat/cuda-pr4

Conversation

@ColoCarletti
Copy link
Copy Markdown
Collaborator

Summary

Extends the GPU-resident proving pipeline through Round 4. R4 DEEP composition and the full FRI commit phase (fold + per-layer Keccak leaves + pair-hash Merkle
tree) now run device-side, with only per-layer roots D2H'd for the transcript. The R2 composition-parts LDE moves to a _keep variant so its de-interleaved device
buffer is retained on Round2 and reused by R4 DEEP without a re-H2D. Also lands a Blelloch chunk-scan parallel batch-inverse kernel as infrastructure for future
GPU-side denominator inversion (not yet wired).

Changes

  • crypto/math-cuda/kernels/{inverse,deep,fri}.cu — new kernels.
  • crypto/math-cuda/src/{inverse,deep,fri}.rs — host orchestrators including FriCommitState (ping-pong eval buffers, in-place inv_twiddles squaring, per-layer
    fused fold + leaves + tree).
  • crypto/stark/src/gpu_lde.rs — new dispatches: try_evaluate_parts_on_lde_gpu_keep, try_deep_composition_gpu, try_fri_commit_gpu. New counters:
    gpu_deep_calls, gpu_fri_calls.
  • crypto/stark/src/prover.rsRound2.gpu_composition_parts holds the R2 keep handle; R4 DEEP fast path inside compute_deep_composition_poly_evaluations
    consumes R1 main/aux + R2 parts handles when available.
  • crypto/stark/src/fri/mod.rscommit_phase_from_evaluations routes through try_fri_commit_gpu when cuda is enabled.
  • Tests: parity for batch invert (n in {2..2^20}), DEEP, FRI per-layer tree (log_num_leaves in {1..18}); cuda_path_integration asserts the two new counters fire
    end-to-end.

Fallback

Every dispatch is gated by TypeId checks (Goldilocks + ext3) and the LDE-size threshold. Below threshold or on any cudarc error, the dispatch returns None and
the existing CPU implementation runs unchanged. Exception: mid-FRI-loop cudarc failure panics, because the transcript is already advanced and a CPU restart would
re-sample zeta_0 against mutated state.

Test plan

  • cargo test -p math-cuda --release --tests (GPU host) — 67 tests
  • cargo test -p stark --release --features cuda — 128 tests
  • cargo test -p stark --release (no cuda) — 128 tests
  • cargo test -p lambda-vm-prover --release (no cuda) — 384 tests
  • cargo test -p lambda-vm-prover --release --features cuda --test cuda_path_integration -- --ignored — all 6 counters fire end-to-end
  • cargo clippy --workspace --all-targets --features cuda -- -D warnings — clean
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo fmt --all --check — clean

ColoCarletti and others added 30 commits May 6, 2026 15:12
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
ColoCarletti and others added 16 commits June 1, 2026 17:24
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
# Conflicts:
#	crypto/math-cuda/build.rs
#	crypto/math-cuda/src/device.rs
#	crypto/math-cuda/src/lib.rs
#	crypto/stark/src/gpu_lde.rs
#	crypto/stark/src/prover.rs
#	prover/tests/cuda_path_integration.rs
@gabrielbosio gabrielbosio added the gpu Related to GPU/CUDA development label Jun 4, 2026
@ColoCarletti
Copy link
Copy Markdown
Collaborator Author

/claude

@ColoCarletti
Copy link
Copy Markdown
Collaborator Author

/codex

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Codex Code Review

Found one issue:

Medium Security / Potential Bug: crypto/math-cuda/src/deep.rs:123

deep_composition_ext3_impl validates buffer lengths against main_lde.lde_size, but it never validates that the kernel row index stays within that stride. The CUDA kernel computes row = i * blowup_factor and then reads main_lde[...] + row, aux_lde[...] + row, and h_lde[...] + row at crypto/math-cuda/kernels/deep.cu:48. A safe Rust caller can pass domain_size * blowup_factor > main_lde.lde_size, causing out-of-bounds device reads.

Add a checked precondition before launching, e.g. ensure domain_size == 0 || (domain_size - 1) * blowup_factor < main_lde.lde_size, using checked arithmetic and returning an error instead of relying on CUDA memory behavior.

No other concrete security, correctness, or significant performance issues found in the PR diff.

Comment thread crypto/math-cuda/kernels/deep.cu Outdated
Comment thread crypto/math-cuda/src/inverse.rs
Comment thread crypto/math-cuda/src/inverse.rs
Comment thread crypto/stark/src/gpu_lde.rs
@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

Review: R4 DEEP composition + FRI commit phase on GPU

Overview

Extends the GPU proving pipeline through Round 4: DEEP composition polynomial evaluations and the full FRI commit phase now run device-side. CPU fallback is preserved throughout via None-returning dispatches, with one deliberate exception for mid-FRI CUDA failures (acknowledged in the PR description). Architecture is clean — the ping-pong buffer design in FriCommitState, the chunk-scan batch-inverse, and the R2 _keep handle reuse are all well-structured. Test coverage (parity tests for batch-inverse, DEEP, FRI per-layer tree, plus end-to-end counter assertions) is good.

Issues Found

Medium – Silent wrong result on zero input (inverse.rs:408)
gl_inv(0) returns 0 via Fermat (0^(p-2) = 0). If norm == 0 (which can only happen if a batch input element is zero), invert_ext3_host silently returns [0,0,0] instead of failing. All batch-inverse outputs then become 0, corrupting the proof without any visible error. Callers today never pass zero elements, but there is no enforcement — add assert!(norm != 0) before the inversion.

Low – Dead variable in deep.cu (deep.cu:68)
num_total_cols is computed but never read — only voided to suppress the compiler warning. Remove it.

Low – Panic contract should be documented (gpu_lde.rs)
try_fri_commit_gpu panics on any CUDA error after the first transcript mutation. The design is correct (a CPU retry would re-sample zeta_0 against an advanced transcript), but the function's doc comment doesn't call this out. Add a # Panics section so future callers are not surprised.

Low – Code duplication in inverse.rs (inverse.rs:178)
batch_inverse_ext3 and batch_inverse_ext3_dev share ~90% of their bodies (all six kernel dispatch phases). The only difference is the initial H2D. Extract a private helper taking a &CudaSlice<u64> to avoid duplicating the logic.

Comment on lines +49 to +59
// update_twiddles_in_place: new[j] = old[2j]^2. Writes in-place. Caller
// must ensure the kernel is not reading the same index concurrently. Since
// we read `old[2j]` and write `new[j]` with j < 2j, there's no aliasing.
extern "C" __global__ void fri_update_twiddles(
uint64_t *tw,
uint64_t n_out) {
uint64_t j = (uint64_t)blockIdx.x * blockDim.x + threadIdx.x;
if (j >= n_out) return;
uint64_t old = tw[2 * j];
tw[j] = goldilocks::mul(old, old);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the sentence in the doc comment:

Caller must ensure the kernel is not reading the same index concurrently.

I think there is no guarantee that the following two things cannot happen concurrently:

  • Thread j reads tw[2 * j]
  • Thread k = 2 * j writes tw[k]

To make sure there is no race condition, we have to separate the buffer in two: input and output.

Suggested change
// update_twiddles_in_place: new[j] = old[2j]^2. Writes in-place. Caller
// must ensure the kernel is not reading the same index concurrently. Since
// we read `old[2j]` and write `new[j]` with j < 2j, there's no aliasing.
extern "C" __global__ void fri_update_twiddles(
uint64_t *tw,
uint64_t n_out) {
uint64_t j = (uint64_t)blockIdx.x * blockDim.x + threadIdx.x;
if (j >= n_out) return;
uint64_t old = tw[2 * j];
tw[j] = goldilocks::mul(old, old);
}
// update_twiddles_in_place: new[j] = old[2j]^2. Writes in-place.
extern "C" __global__ void fri_update_twiddles(
const uint64_t *tw_in, uint64_t *tw_out, uint64_t n_out) {
uint64_t j = (uint64_t)blockIdx.x * blockDim.x + threadIdx.x;
if (j >= n_out) return;
uint64_t old = tw_in[2 * j];
tw_out[j] = goldilocks::mul(old, old);
}

Comment on lines +1302 to +1311
for _ in 0..num_committed_layers {
// <<<< Receive challenge zeta_k
let zeta: FieldElement<E> = transcript.sample_field_element();
// SAFETY: E == Ext3.
let zeta_ptr = &zeta as *const FieldElement<E> as *const u64;
let zeta_raw: [u64; 3] = unsafe { [*zeta_ptr, *zeta_ptr.add(1), *zeta_ptr.add(2)] };

let (root, layer_evals_u64, nodes_bytes) = state
.fold_and_commit_layer(zeta_raw)
.expect("FRI commit: GPU fold+tree must not fail mid-phase (transcript advanced)");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of turning a CUDA error into a panic, we can fallback to CPU like we handled the other CUDA errors. We would have to add a snapshot of the transcript to avoid falling back to CPU path with a corrupt state:

Suggested change
for _ in 0..num_committed_layers {
// <<<< Receive challenge zeta_k
let zeta: FieldElement<E> = transcript.sample_field_element();
// SAFETY: E == Ext3.
let zeta_ptr = &zeta as *const FieldElement<E> as *const u64;
let zeta_raw: [u64; 3] = unsafe { [*zeta_ptr, *zeta_ptr.add(1), *zeta_ptr.add(2)] };
let (root, layer_evals_u64, nodes_bytes) = state
.fold_and_commit_layer(zeta_raw)
.expect("FRI commit: GPU fold+tree must not fail mid-phase (transcript advanced)");
let snapshot = transcript.clone();
for _ in 0..num_committed_layers {
// <<<< Receive challenge zeta_k
let zeta: FieldElement<E> = transcript.sample_field_element();
// SAFETY: E == Ext3.
let zeta_ptr = &zeta as *const FieldElement<E> as *const u64;
let zeta_raw: [u64; 3] = unsafe { [*zeta_ptr, *zeta_ptr.add(1), *zeta_ptr.add(2)] };
let (root, layer_evals_u64, nodes_bytes) = match state.fold_and_commit_layer(zeta_raw) {
Ok(v) => v,
Err(_) => { *transcript = snapshot; return None; }
};

Comment on lines +1337 to +1339
let last_raw = state
.fold_final(zeta_raw)
.expect("FRI commit: GPU final fold must not fail mid-phase (transcript advanced)");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the same lines of https://github.com/yetanotherco/lambda_vm/pull/648/changes#r3365293666:

Suggested change
let last_raw = state
.fold_final(zeta_raw)
.expect("FRI commit: GPU final fold must not fail mid-phase (transcript advanced)");
let last_raw = match state.fold_final(zeta_raw) {
Ok(v) => v,
Err(_) => {
*transcript = snapshot;
return None;
}
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpu Related to GPU/CUDA development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants