Fast, zero-dependency, inference engine for Gemma 4 in pure Java.
- Single file, no dependencies
- GGUF format parser
- Gemma 4 tokenizer
- Supports all Gemma 4 model families:
E2B,E4B,31B, and26B-A4B(MoE) - Mixture of Experts routing and execution
- Sliding Window Attention (SWA) and full-attention layers
- Per-layer KV cache sharing and per-head Q/K RMS normalization
- Supported dtypes/quantizations:
F16,BF16,F32,Q4_0,Q4_1,Q4_K,Q5_K,Q6_K,Q8_0 - Thinking mode control with
--think off|on|inline - Matrix-vector kernels using Java's Vector API
- CLI with
--chatand--instructmodes - GraalVM Native Image support
- AOT model preloading for lower time-to-first-token
Download GGUF models from Hugging Face:
| Model | Architecture | GGUF Repository |
|---|---|---|
| E2B | Dense, ~5B total params | unsloth/gemma-4-E2B-it-GGUF |
| E4B | Dense, ~8B total params | unsloth/gemma-4-E4B-it-GGUF |
| 31B | Dense | unsloth/gemma-4-31B-it-GGUF |
| 26B-A4B | Mixture of Experts (MoE) | unsloth/gemma-4-26B-A4B-it-GGUF |
Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K).
A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:
./llama-quantize --pure ./gemma-4-E2B-it-BF16.gguf ./gemma-4-E2B-it-Q4_0.gguf Q4_0Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.
Java 21+ is required, in particular for the MemorySegment mmap-ing feature.
jbang is a good fit for this use case:
jbang Gemma4.java --help
jbang Gemma4.java --model ./gemma-4-E2B-it-Q4_0.gguf --chat
jbang Gemma4.java --model ./gemma-4-E2B-it-Q4_0.gguf --prompt "Explain quantum computing like I'm five"
Or run it directly (still via jbang):
chmod +x Gemma4.java
./Gemma4.java --helpA simple Makefile is provided. Run make jar to produce gemma4.jar.
Run the resulting gemma4.jar as follows:
java --enable-preview --add-modules jdk.incubator.vector -jar gemma4.jar --helpCompile with make native to produce a gemma4 executable, then:
./gemma4 --model ./gemma-4-E2B-it-Q4_0.gguf --chatGemma4.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).
To AOT pre-load a GGUF model:
PRELOAD_GGUF=/path/to/model.gguf make nativeA larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with normal parsing behavior.
GraalVM 25+ is recommended for the absolute best performance (JIT), it provides partial, but good support for the Vector API.
By default, the "preferred" vector size is used, it can be force-set with -Dllama.VectorBitSize=0|128|256|512, 0 means disabled.
Apache 2.0
