feat(cuda): R4 DEEP composition + FRI commit phase on GPU#648
feat(cuda): R4 DEEP composition + FRI commit phase on GPU#648ColoCarletti wants to merge 59 commits into
Conversation
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
# Conflicts: # crypto/math-cuda/build.rs # crypto/math-cuda/src/device.rs # crypto/math-cuda/src/lib.rs # crypto/stark/src/gpu_lde.rs # crypto/stark/src/prover.rs # prover/tests/cuda_path_integration.rs
|
/claude |
|
/codex |
Codex Code ReviewFound one issue: Medium Security / Potential Bug: crypto/math-cuda/src/deep.rs:123
Add a checked precondition before launching, e.g. ensure No other concrete security, correctness, or significant performance issues found in the PR diff. |
Review: R4 DEEP composition + FRI commit phase on GPUOverviewExtends the GPU proving pipeline through Round 4: DEEP composition polynomial evaluations and the full FRI commit phase now run device-side. CPU fallback is preserved throughout via Issues FoundMedium – Silent wrong result on zero input ( Low – Dead variable in Low – Panic contract should be documented ( Low – Code duplication in |
| // update_twiddles_in_place: new[j] = old[2j]^2. Writes in-place. Caller | ||
| // must ensure the kernel is not reading the same index concurrently. Since | ||
| // we read `old[2j]` and write `new[j]` with j < 2j, there's no aliasing. | ||
| extern "C" __global__ void fri_update_twiddles( | ||
| uint64_t *tw, | ||
| uint64_t n_out) { | ||
| uint64_t j = (uint64_t)blockIdx.x * blockDim.x + threadIdx.x; | ||
| if (j >= n_out) return; | ||
| uint64_t old = tw[2 * j]; | ||
| tw[j] = goldilocks::mul(old, old); | ||
| } |
There was a problem hiding this comment.
Regarding the sentence in the doc comment:
Caller must ensure the kernel is not reading the same index concurrently.
I think there is no guarantee that the following two things cannot happen concurrently:
- Thread
jreadstw[2 * j] - Thread
k = 2 * jwritestw[k]
To make sure there is no race condition, we have to separate the buffer in two: input and output.
| // update_twiddles_in_place: new[j] = old[2j]^2. Writes in-place. Caller | |
| // must ensure the kernel is not reading the same index concurrently. Since | |
| // we read `old[2j]` and write `new[j]` with j < 2j, there's no aliasing. | |
| extern "C" __global__ void fri_update_twiddles( | |
| uint64_t *tw, | |
| uint64_t n_out) { | |
| uint64_t j = (uint64_t)blockIdx.x * blockDim.x + threadIdx.x; | |
| if (j >= n_out) return; | |
| uint64_t old = tw[2 * j]; | |
| tw[j] = goldilocks::mul(old, old); | |
| } | |
| // update_twiddles_in_place: new[j] = old[2j]^2. Writes in-place. | |
| extern "C" __global__ void fri_update_twiddles( | |
| const uint64_t *tw_in, uint64_t *tw_out, uint64_t n_out) { | |
| uint64_t j = (uint64_t)blockIdx.x * blockDim.x + threadIdx.x; | |
| if (j >= n_out) return; | |
| uint64_t old = tw_in[2 * j]; | |
| tw_out[j] = goldilocks::mul(old, old); | |
| } |
| for _ in 0..num_committed_layers { | ||
| // <<<< Receive challenge zeta_k | ||
| let zeta: FieldElement<E> = transcript.sample_field_element(); | ||
| // SAFETY: E == Ext3. | ||
| let zeta_ptr = &zeta as *const FieldElement<E> as *const u64; | ||
| let zeta_raw: [u64; 3] = unsafe { [*zeta_ptr, *zeta_ptr.add(1), *zeta_ptr.add(2)] }; | ||
|
|
||
| let (root, layer_evals_u64, nodes_bytes) = state | ||
| .fold_and_commit_layer(zeta_raw) | ||
| .expect("FRI commit: GPU fold+tree must not fail mid-phase (transcript advanced)"); |
There was a problem hiding this comment.
Instead of turning a CUDA error into a panic, we can fallback to CPU like we handled the other CUDA errors. We would have to add a snapshot of the transcript to avoid falling back to CPU path with a corrupt state:
| for _ in 0..num_committed_layers { | |
| // <<<< Receive challenge zeta_k | |
| let zeta: FieldElement<E> = transcript.sample_field_element(); | |
| // SAFETY: E == Ext3. | |
| let zeta_ptr = &zeta as *const FieldElement<E> as *const u64; | |
| let zeta_raw: [u64; 3] = unsafe { [*zeta_ptr, *zeta_ptr.add(1), *zeta_ptr.add(2)] }; | |
| let (root, layer_evals_u64, nodes_bytes) = state | |
| .fold_and_commit_layer(zeta_raw) | |
| .expect("FRI commit: GPU fold+tree must not fail mid-phase (transcript advanced)"); | |
| let snapshot = transcript.clone(); | |
| for _ in 0..num_committed_layers { | |
| // <<<< Receive challenge zeta_k | |
| let zeta: FieldElement<E> = transcript.sample_field_element(); | |
| // SAFETY: E == Ext3. | |
| let zeta_ptr = &zeta as *const FieldElement<E> as *const u64; | |
| let zeta_raw: [u64; 3] = unsafe { [*zeta_ptr, *zeta_ptr.add(1), *zeta_ptr.add(2)] }; | |
| let (root, layer_evals_u64, nodes_bytes) = match state.fold_and_commit_layer(zeta_raw) { | |
| Ok(v) => v, | |
| Err(_) => { *transcript = snapshot; return None; } | |
| }; |
| let last_raw = state | ||
| .fold_final(zeta_raw) | ||
| .expect("FRI commit: GPU final fold must not fail mid-phase (transcript advanced)"); |
There was a problem hiding this comment.
Following the same lines of https://github.com/yetanotherco/lambda_vm/pull/648/changes#r3365293666:
| let last_raw = state | |
| .fold_final(zeta_raw) | |
| .expect("FRI commit: GPU final fold must not fail mid-phase (transcript advanced)"); | |
| let last_raw = match state.fold_final(zeta_raw) { | |
| Ok(v) => v, | |
| Err(_) => { | |
| *transcript = snapshot; | |
| return None; | |
| } | |
| }; |
Summary
Extends the GPU-resident proving pipeline through Round 4. R4 DEEP composition and the full FRI commit phase (fold + per-layer Keccak leaves + pair-hash Merkle
tree) now run device-side, with only per-layer roots D2H'd for the transcript. The R2 composition-parts LDE moves to a
_keepvariant so its de-interleaved devicebuffer is retained on
Round2and reused by R4 DEEP without a re-H2D. Also lands a Blelloch chunk-scan parallel batch-inverse kernel as infrastructure for futureGPU-side denominator inversion (not yet wired).
Changes
crypto/math-cuda/kernels/{inverse,deep,fri}.cu— new kernels.crypto/math-cuda/src/{inverse,deep,fri}.rs— host orchestrators includingFriCommitState(ping-pong eval buffers, in-place inv_twiddles squaring, per-layerfused fold + leaves + tree).
crypto/stark/src/gpu_lde.rs— new dispatches:try_evaluate_parts_on_lde_gpu_keep,try_deep_composition_gpu,try_fri_commit_gpu. New counters:gpu_deep_calls,gpu_fri_calls.crypto/stark/src/prover.rs—Round2.gpu_composition_partsholds the R2 keep handle; R4 DEEP fast path insidecompute_deep_composition_poly_evaluationsconsumes R1 main/aux + R2 parts handles when available.
crypto/stark/src/fri/mod.rs—commit_phase_from_evaluationsroutes throughtry_fri_commit_gpuwhen cuda is enabled.cuda_path_integrationasserts the two new counters fireend-to-end.
Fallback
Every dispatch is gated by TypeId checks (Goldilocks + ext3) and the LDE-size threshold. Below threshold or on any cudarc error, the dispatch returns
Noneandthe existing CPU implementation runs unchanged. Exception: mid-FRI-loop cudarc failure panics, because the transcript is already advanced and a CPU restart would
re-sample
zeta_0against mutated state.Test plan
cargo test -p math-cuda --release --tests(GPU host) — 67 testscargo test -p stark --release --features cuda— 128 testscargo test -p stark --release(no cuda) — 128 testscargo test -p lambda-vm-prover --release(no cuda) — 384 testscargo test -p lambda-vm-prover --release --features cuda --test cuda_path_integration -- --ignored— all 6 counters fire end-to-endcargo clippy --workspace --all-targets --features cuda -- -D warnings— cleancargo clippy --workspace --all-targets -- -D warnings— cleancargo fmt --all --check— clean