feat(cuda): R4 DEEP composition + FRI commit phase on GPU by ColoCarletti · Pull Request #648 · yetanotherco/lambda_vm

ColoCarletti · 2026-06-03T18:33:01Z

Summary

Extends the GPU-resident proving pipeline through Round 4. R4 DEEP composition and the full FRI commit phase (fold + per-layer Keccak leaves + pair-hash Merkle
tree) now run device-side, with only per-layer roots D2H'd for the transcript. The R2 composition-parts LDE moves to a _keep variant so its de-interleaved device
buffer is retained on Round2 and reused by R4 DEEP without a re-H2D. Also lands a Blelloch chunk-scan parallel batch-inverse kernel as infrastructure for future
GPU-side denominator inversion (not yet wired).

Changes

crypto/math-cuda/kernels/{inverse,deep,fri}.cu — new kernels.
crypto/math-cuda/src/{inverse,deep,fri}.rs — host orchestrators including FriCommitState (ping-pong eval buffers, in-place inv_twiddles squaring, per-layer
fused fold + leaves + tree).
crypto/stark/src/gpu_lde.rs — new dispatches: try_evaluate_parts_on_lde_gpu_keep, try_deep_composition_gpu, try_fri_commit_gpu. New counters:
gpu_deep_calls, gpu_fri_calls.
crypto/stark/src/prover.rs — Round2.gpu_composition_parts holds the R2 keep handle; R4 DEEP fast path inside compute_deep_composition_poly_evaluations
consumes R1 main/aux + R2 parts handles when available.
crypto/stark/src/fri/mod.rs — commit_phase_from_evaluations routes through try_fri_commit_gpu when cuda is enabled.
Tests: parity for batch invert (n in {2..2^20}), DEEP, FRI per-layer tree (log_num_leaves in {1..18}); cuda_path_integration asserts the two new counters fire
end-to-end.

Fallback

Every dispatch is gated by TypeId checks (Goldilocks + ext3) and the LDE-size threshold. Below threshold or on any cudarc error, the dispatch returns None and
the existing CPU implementation runs unchanged. Exception: mid-FRI-loop cudarc failure panics, because the transcript is already advanced and a CPU restart would
re-sample zeta_0 against mutated state.

Test plan

cargo test -p math-cuda --release --tests (GPU host) — 67 tests
cargo test -p stark --release --features cuda — 128 tests
cargo test -p stark --release (no cuda) — 128 tests
cargo test -p lambda-vm-prover --release (no cuda) — 384 tests
cargo test -p lambda-vm-prover --release --features cuda --test cuda_path_integration -- --ignored — all 6 counters fire end-to-end
cargo clippy --workspace --all-targets --features cuda -- -D warnings — clean
cargo clippy --workspace --all-targets -- -D warnings — clean
cargo fmt --all --check — clean

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…commits

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…o feat/cuda-pr3

# Conflicts: # crypto/math-cuda/build.rs # crypto/math-cuda/src/device.rs # crypto/math-cuda/src/lib.rs # crypto/stark/src/gpu_lde.rs # crypto/stark/src/prover.rs # prover/tests/cuda_path_integration.rs

ColoCarletti · 2026-06-05T14:05:10Z

/claude

ColoCarletti · 2026-06-05T14:05:36Z

/codex

github-actions · 2026-06-05T14:09:01Z

Codex Code Review

Found one issue:

Medium Security / Potential Bug: crypto/math-cuda/src/deep.rs:123

deep_composition_ext3_impl validates buffer lengths against main_lde.lde_size, but it never validates that the kernel row index stays within that stride. The CUDA kernel computes row = i * blowup_factor and then reads main_lde[...] + row, aux_lde[...] + row, and h_lde[...] + row at crypto/math-cuda/kernels/deep.cu:48. A safe Rust caller can pass domain_size * blowup_factor > main_lde.lde_size, causing out-of-bounds device reads.

Add a checked precondition before launching, e.g. ensure domain_size == 0 || (domain_size - 1) * blowup_factor < main_lde.lde_size, using checked arithmetic and returning an error instead of relying on CUDA memory behavior.

No other concrete security, correctness, or significant performance issues found in the PR diff.

claude · 2026-06-05T14:15:25Z

Review: R4 DEEP composition + FRI commit phase on GPU

Overview

Extends the GPU proving pipeline through Round 4: DEEP composition polynomial evaluations and the full FRI commit phase now run device-side. CPU fallback is preserved throughout via None-returning dispatches, with one deliberate exception for mid-FRI CUDA failures (acknowledged in the PR description). Architecture is clean — the ping-pong buffer design in FriCommitState, the chunk-scan batch-inverse, and the R2 _keep handle reuse are all well-structured. Test coverage (parity tests for batch-inverse, DEEP, FRI per-layer tree, plus end-to-end counter assertions) is good.

Issues Found

Medium – Silent wrong result on zero input (inverse.rs:408)
gl_inv(0) returns 0 via Fermat (0^(p-2) = 0). If norm == 0 (which can only happen if a batch input element is zero), invert_ext3_host silently returns [0,0,0] instead of failing. All batch-inverse outputs then become 0, corrupting the proof without any visible error. Callers today never pass zero elements, but there is no enforcement — add assert!(norm != 0) before the inversion.

Low – Dead variable in deep.cu (deep.cu:68)
num_total_cols is computed but never read — only voided to suppress the compiler warning. Remove it.

Low – Panic contract should be documented (gpu_lde.rs)
try_fri_commit_gpu panics on any CUDA error after the first transcript mutation. The design is correct (a CPU retry would re-sample zeta_0 against an advanced transcript), but the function's doc comment doesn't call this out. Add a # Panics section so future callers are not surprised.

Low – Code duplication in inverse.rs (inverse.rs:178)
batch_inverse_ext3 and batch_inverse_ext3_dev share ~90% of their bodies (all six kernel dispatch phases). The only difference is the initial H2D. Extract a private helper taking a &CudaSlice<u64> to avoid duplicating the logic.

gabrielbosio · 2026-06-05T19:13:52Z

+// update_twiddles_in_place: new[j] = old[2j]^2. Writes in-place. Caller
+// must ensure the kernel is not reading the same index concurrently. Since
+// we read `old[2j]` and write `new[j]` with j < 2j, there's no aliasing.
+extern "C" __global__ void fri_update_twiddles(
+    uint64_t *tw,
+    uint64_t n_out) {
+    uint64_t j = (uint64_t)blockIdx.x * blockDim.x + threadIdx.x;
+    if (j >= n_out) return;
+    uint64_t old = tw[2 * j];
+    tw[j] = goldilocks::mul(old, old);
+}


Regarding the sentence in the doc comment:

Caller must ensure the kernel is not reading the same index concurrently.

I think there is no guarantee that the following two things cannot happen concurrently:

Thread j reads tw[2 * j]

Thread k = 2 * j writes tw[k]

To make sure there is no race condition, we have to separate the buffer in two: input and output.

Suggested change

// update_twiddles_in_place: new[j] = old[2j]^2. Writes in-place. Caller

// must ensure the kernel is not reading the same index concurrently. Since

// we read `old[2j]` and write `new[j]` with j < 2j, there's no aliasing.

extern "C" __global__ void fri_update_twiddles(

uint64_t *tw,

uint64_t n_out) {

uint64_t j = (uint64_t)blockIdx.x * blockDim.x + threadIdx.x;

if (j >= n_out) return;

uint64_t old = tw[2 * j];

tw[j] = goldilocks::mul(old, old);

}

// update_twiddles_in_place: new[j] = old[2j]^2. Writes in-place.

extern "C" __global__ void fri_update_twiddles(

const uint64_t *tw_in, uint64_t *tw_out, uint64_t n_out) {

uint64_t j = (uint64_t)blockIdx.x * blockDim.x + threadIdx.x;

if (j >= n_out) return;

uint64_t old = tw_in[2 * j];

tw_out[j] = goldilocks::mul(old, old);

}

gabrielbosio · 2026-06-05T20:45:23Z

+    for _ in 0..num_committed_layers {
+        // <<<< Receive challenge zeta_k
+        let zeta: FieldElement<E> = transcript.sample_field_element();
+        // SAFETY: E == Ext3.
+        let zeta_ptr = &zeta as *const FieldElement<E> as *const u64;
+        let zeta_raw: [u64; 3] = unsafe { [*zeta_ptr, *zeta_ptr.add(1), *zeta_ptr.add(2)] };
+
+        let (root, layer_evals_u64, nodes_bytes) = state
+            .fold_and_commit_layer(zeta_raw)
+            .expect("FRI commit: GPU fold+tree must not fail mid-phase (transcript advanced)");


Instead of turning a CUDA error into a panic, we can fallback to CPU like we handled the other CUDA errors. We would have to add a snapshot of the transcript to avoid falling back to CPU path with a corrupt state:

Suggested change

for _ in 0..num_committed_layers {

// <<<< Receive challenge zeta_k

let zeta: FieldElement<E> = transcript.sample_field_element();

// SAFETY: E == Ext3.

let zeta_ptr = &zeta as *const FieldElement<E> as *const u64;

let zeta_raw: [u64; 3] = unsafe { [*zeta_ptr, *zeta_ptr.add(1), *zeta_ptr.add(2)] };

let (root, layer_evals_u64, nodes_bytes) = state

.fold_and_commit_layer(zeta_raw)

.expect("FRI commit: GPU fold+tree must not fail mid-phase (transcript advanced)");

let snapshot = transcript.clone();

for _ in 0..num_committed_layers {

// <<<< Receive challenge zeta_k

let zeta: FieldElement<E> = transcript.sample_field_element();

// SAFETY: E == Ext3.

let zeta_ptr = &zeta as *const FieldElement<E> as *const u64;

let zeta_raw: [u64; 3] = unsafe { [*zeta_ptr, *zeta_ptr.add(1), *zeta_ptr.add(2)] };

let (root, layer_evals_u64, nodes_bytes) = match state.fold_and_commit_layer(zeta_raw) {

Ok(v) => v,

Err(_) => { *transcript = snapshot; return None; }

};

gabrielbosio · 2026-06-05T21:22:41Z

+    let last_raw = state
+        .fold_final(zeta_raw)
+        .expect("FRI commit: GPU final fold must not fail mid-phase (transcript advanced)");


Following the same lines of https://github.com/yetanotherco/lambda_vm/pull/648/changes#r3365293666:

Suggested change

let last_raw = state

.fold_final(zeta_raw)

.expect("FRI commit: GPU final fold must not fail mid-phase (transcript advanced)");

let last_raw = match state.fold_final(zeta_raw) {

Ok(v) => v,

Err(_) => {

*transcript = snapshot;

return None;

}

};

ColoCarletti and others added 30 commits May 6, 2026 15:12

add first cuda files

d1a0abf

fmt

79634ff

fix clippy

ac6fbb5

gpu 2nd part

2ceb3b0

feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles

affceb1

merge main

01172f2

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

c4627e1

comments fix

01aa5e4

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

cfc5c19

Update crypto/stark/src/gpu_lde.rs

ea5696f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

a8cf265

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

fb8d31f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

a79f2b5

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

761a2c0

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

address reviews

e066e9d

fix review comments

7d3d0f0

Merge remote-tracking branch 'origin/main' into feat/cuda-pr2-r1-gpu-…

cf80771

…commits

address doc comment suggestions

71aba0d

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

83d91b8

fix

34cae4b

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

f076bf4

Pass replay transcript to bus-balance call in verify_vm_minimal

a2cde0f

Update crypto/math-cuda/src/device.rs

46c305b

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

aca3dca

Update crypto/math-cuda/src/device.rs

63d7c00

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/device.rs

eb16c02

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/device.rs

66925b1

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

4e6daf3

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

4cd27d9

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

5fe390f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

ColoCarletti and others added 16 commits June 1, 2026 17:24

Update crypto/math-cuda/src/barycentric.rs

1f9394d

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/barycentric.rs

b07999c

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

fix imports

c575017

Merge branch 'feat/cuda-pr3' of github.com:yetanotherco/lambda_vm int…

0ffc661

…o feat/cuda-pr3

Merge branch 'main' into feat/cuda-pr3

0777f1e

cuda integration tests

2c7b0de

address review feedback

2f1fe2d

batch invert kernels and parity test

f254eae

DEEP composition kernel

84cc04b

fri

0ba7745

gpu lde

7046a40

gpu_lde

065c8f9

fri

7d2810f

add tests

cc840cd

fix

fac3974

Merge branch 'main' into feat/cuda-pr4

bc61a00

# Conflicts: # crypto/math-cuda/build.rs # crypto/math-cuda/src/device.rs # crypto/math-cuda/src/lib.rs # crypto/stark/src/gpu_lde.rs # crypto/stark/src/prover.rs # prover/tests/cuda_path_integration.rs

gabrielbosio added the gpu Related to GPU/CUDA development label Jun 4, 2026

claude Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread crypto/math-cuda/kernels/deep.cu Outdated

claude Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/inverse.rs

claude Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread crypto/math-cuda/src/inverse.rs

claude Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs

ColoCarletti added 2 commits June 5, 2026 11:41

fix comments

3c52fdf

add integration tests

59437f3

gabrielbosio reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cuda): R4 DEEP composition + FRI commit phase on GPU#648

feat(cuda): R4 DEEP composition + FRI commit phase on GPU#648
ColoCarletti wants to merge 59 commits into
mainfrom
feat/cuda-pr4

ColoCarletti commented Jun 3, 2026

Uh oh!

ColoCarletti commented Jun 5, 2026

Uh oh!

ColoCarletti commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 5, 2026

Uh oh!

gabrielbosio Jun 5, 2026

Uh oh!

gabrielbosio Jun 5, 2026

Uh oh!

gabrielbosio Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    let last_raw = state
-        .fold_final(zeta_raw)
-        .expect("FRI commit: GPU final fold must not fail mid-phase (transcript advanced)");
+    let last_raw = match state.fold_final(zeta_raw) {
+        Ok(v) => v,
+        Err(_) => {
+            *transcript = snapshot;
+            return None;
+        }
+    };

Conversation

ColoCarletti commented Jun 3, 2026

Summary

Changes

Fallback

Test plan

Uh oh!

ColoCarletti commented Jun 5, 2026

Uh oh!

ColoCarletti commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Codex Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 5, 2026

Review: R4 DEEP composition + FRI commit phase on GPU

Overview

Issues Found

Uh oh!

gabrielbosio Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gabrielbosio Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gabrielbosio Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants