-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Environment
- Platform: Apple Silicon Mac
- Host: Apple M4 Max
- OS: macOS
- Compiler: Homebrew clang 18.1.8
- BitNet / submodule state: BitNet using vendored
3rdparty/llama.cppat Eddie-Wang1120/llama.cpp commit1f86f058de0c3f4098dedae2ae8653c335c868a1 - Model:
microsoft/BitNet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf - Build flags:
GGML_METAL=ONGGML_ACCELERATE=ONGGML_BLAS=ONGGML_BLAS_VENDOR=AppleBITNET_ARM_TL1=OFF
Problem
On Apple Silicon with Metal enabled, i2_s inference can segfault when BLAS is enabled and the physical micro-batch crosses the BLAS routing threshold.
The crash is tied to physical ubatch, not logical batch:
-b 2048 -ub 31-> stable-b 32 -ub 31-> stable-b 2048 -ub 32-> segfault-b 2048 -ub 512-> segfault
This means the failure starts exactly when the BLAS backend begins claiming the generic MUL_MAT path for larger batches.
Control Experiment
The same Metal runtime is stable when BLAS is disabled:
- BLAS ON +
-b 2048 -ub 512-> segfault - BLAS OFF +
-b 2048 -ub 512-> stable
This strongly suggests the crash is in the BLAS-side handling of GGML_TYPE_I2_S, not in Metal itself and not in the outer chat request schema.
Root Cause
ggml-blas.cpp allows the generic BLAS MUL_MAT path to accept quantized source tensors when ggml_get_type_traits(src0->type)->to_float != NULL.
For GGML_TYPE_I2_S, that is not safe:
I2_Sstores an external scale outside the per-row payload- the generic BLAS dequantize-to-float path assumes self-contained per-row data
- once
ubatch >= 32, BLAS starts claimingMUL_MAT - that eventually crashes in the
i2_sdequant / BLAS matmul path
In crash reports, the top frames consistently land in:
dequantize_row_i2_sggml_backend_blas_mul_mat
Proposed Fix
Reject GGML_TYPE_I2_S in the generic BLAS MUL_MAT support check so that I2_S continues using its specialized non-BLAS path:
return src0->type != GGML_TYPE_I2_S &&
ggml_is_contiguous(src0) &&
ggml_is_contiguous(src1) &&
src1->type == GGML_TYPE_F32 &&
(ne0 >= min_batch && ne1 >= min_batch && ne10 >= min_batch) &&
(src0->type == GGML_TYPE_F32 || ggml_get_type_traits(src0->type)->to_float != NULL);Result After Patch
After applying the BLAS guard above:
- BLAS ON + Metal +
-b 2048 -ub 512is stable - managed broker end-to-end requests no longer segfault under the same settings
This does not solve all i2_s quality issues, but it does remove the native crash path.
Related Issues
- UB in to_float callback for I2_S (missing scale param) + 3 more bugs on ARM Ampere Altra #468
- ARM I2_S inference produces gibberish/garbage output after commit 112f853 (CPU Optimization update) #470
- Model only outputs G repeatedly in interactive mode with ggml-model-i2_s.gguf #195
- Garbage output on ARMv8.0 (Cortex-A53/A73) — NEON-only fallback path produces incorrect results #411