Gemma4.java

Fast, zero-dependency, inference engine for Gemma 4 in pure Java.

Features

Single file, no dependencies
GGUF format parser
Gemma 4 tokenizer
Supports all Gemma 4 model families: E2B, E4B, 31B, and 26B-A4B (MoE)
Mixture of Experts routing and execution
Sliding Window Attention (SWA) and full-attention layers
Per-layer KV cache sharing and per-head Q/K RMS normalization
Supported dtypes/quantizations: F16, BF16, F32, Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, Q8_0
Thinking mode control with --think off|on|inline
Matrix-vector kernels using Java's Vector API
CLI with --chat and --instruct modes
GraalVM Native Image support
AOT model preloading for lower time-to-first-token

Setup

Download GGUF models from Hugging Face:

Model	Architecture	GGUF Repository
E2B	Dense, ~5B total params	unsloth/gemma-4-E2B-it-GGUF
E4B	Dense, ~8B total params	unsloth/gemma-4-E4B-it-GGUF
31B	Dense	unsloth/gemma-4-31B-it-GGUF
26B-A4B	Mixture of Experts (MoE)	unsloth/gemma-4-26B-A4B-it-GGUF

Optional: pure quantizations

Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K). A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:

./llama-quantize --pure ./gemma-4-E2B-it-BF16.gguf ./gemma-4-E2B-it-Q4_0.gguf Q4_0

Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.

Build and run

Java 21+ is required, in particular for the MemorySegment mmap-ing feature.

jbang is a good fit for this use case:

jbang Gemma4.java --help
jbang Gemma4.java --model ./gemma-4-E2B-it-Q4_0.gguf --chat
jbang Gemma4.java --model ./gemma-4-E2B-it-Q4_0.gguf --prompt "Explain quantum computing like I'm five"

Or run it directly (still via jbang):

chmod +x Gemma4.java
./Gemma4.java --help

Optional: Makefile

A simple Makefile is provided. Run make jar to produce gemma4.jar.

Run the resulting gemma4.jar as follows:

java --enable-preview --add-modules jdk.incubator.vector -jar gemma4.jar --help

GraalVM Native Image

Compile with make native to produce a gemma4 executable, then:

./gemma4 --model ./gemma-4-E2B-it-Q4_0.gguf --chat

AOT model preloading

Gemma4.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).

To AOT pre-load a GGUF model:

PRELOAD_GGUF=/path/to/model.gguf make native

A larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with normal parsing behavior.

Performance

GraalVM 25+ is recommended for the absolute best performance (JIT), it provides partial, but good support for the Vector API.

By default, the "preferred" vector size is used, it can be force-set with -Dllama.VectorBitSize=0|128|256|512, 0 means disabled.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Gemma4.java		Gemma4.java
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemma4.java

Features

Setup

Optional: pure quantizations

Build and run

Optional: Makefile

GraalVM Native Image

AOT model preloading

Performance

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gemma4.java

Features

Setup

Optional: pure quantizations

Build and run

Optional: Makefile

GraalVM Native Image

AOT model preloading

Performance

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages