Every model you download ships in a format designed for training clusters, not laptops. Squish converts it once into a Metal-native format that maps directly into unified memory β sub-second cold loads, every time.
β οΈ macOS + Apple Silicon (M1βM5) only. Linux/CUDA support is on the roadmap.
| mlx_lm (cold) | Ollama | Squish | |
|---|---|---|---|
| Cold-start load time | 28.81 s | 8β25 s | 0.33β0.53 s |
| RAM during load | ~2,400 MB | ~2,000β8,000 MB | 160 MB β‘ |
| Disk size β 8B model | 16.4 GB | ~4.7 GB (GGUF q4) | 4.4 GB (INT4 squished) |
| Throughput β qwen3:8b M3 | 12β16 tok/s | 14β19 tok/s | 14β22 tok/s β |
| Throughput β qwen3:4b M3 | 28β36 tok/s | 30β40 tok/s | 35β50 tok/s β |
| Throughput β qwen3:1.7b M3 | 55β70 tok/s | 55β75 tok/s | 65β90 tok/s β |
| OpenAI-compatible API | β | β | β |
| Ollama-compatible API | β | β | β |
| Web chat UI | β | β | β |
| Grammar-enforced tool calling | β | β | β |
| Batch / concurrent requests | β | limited | β |
| macOS menu bar app | β | β | β |
| VS Code extension | β | β | β |
| Pre-squished weights (skip compression) | N/A | N/A | β (HuggingFace) |
| Source available | β | β | β |
54Γ faster cold load. 15Γ less RAM during load. 3.7Γ smaller model files. Statistically identical outputs.
β‘ 160 MB = Apple Metal virtual-address delta during load (mmap, no CPU heap). Peak RSS ~402 MB.
β Throughput measured with --agent preset (AgentKV + speculative decode). MLX-native, no GGUF conversion.
| Model | Raw (bf16) | Squished (INT4) | Saved |
|---|---|---|---|
| qwen3:0.6b | 1.3 GB | 0.4 GB | 69% |
| qwen3:1.7b | 3.5 GB | 1.0 GB | 71% |
| qwen3:4b | 8.2 GB | 2.2 GB | 73% |
| qwen3:8b | 16.4 GB | 4.4 GB | 73% |
| qwen3:14b | 28.7 GB | 7.6 GB | 74% |
| llama3.1:8b | 16.1 GB | 4.3 GB | 73% |
| deepseek-r1:7b | 14.4 GB | 3.9 GB | 73% |
# Homebrew (recommended)
brew install wesleyscholl/squish/squish# One-liner installer
curl -fsSL https://raw.githubusercontent.com/squishai/squish/main/install.sh | bash# pip
pip install squishsquish run # auto-detects RAM, pulls + starts best model for your machineOr pick a specific model:
squish catalog # browse 29 available models
squish pull qwen3:8b # download pre-squished weights from HuggingFace (~4.4 GB)
squish run qwen3:8b # start server on :11435Then open http://localhost:11435/chat in any browser.
Or chat in the terminal:
squish chat qwen3:8bDrop-in for any OpenAI or Ollama client:
export OPENAI_BASE_URL=http://localhost:11435/v1
export OPENAI_API_KEY=squish
# or
export OLLAMA_HOST=http://localhost:11435First time? Use the interactive setup wizard:
squish setup # detects your RAM, recommends a model, pulls + starts it- Sub-second loads β INT4 npy-dir format maps directly into Apple Metal unified memory; no dtype conversion on every boot
- OpenAI + Ollama drop-in β any existing client works with a single env-var change; no code changes required
- macOS menu bar app β SquishBar lives in your menu bar; shows live tok/s, start/stop server, one-click model switch
- VS Code extension β sidebar chat with streaming, model selector, server lifecycle management (setup guide)
- Web chat UI β built-in at
/chat; dark-themed, streaming, offline, multi-session history - Grammar-enforced tool calling β XGrammar FSM prevents malformed JSON in tool use; works with any OpenAI
toolsclient - Agent preset β
--agent(auto-enabled on Apple Silicon) wires AgentKV INT2 + speculative decode + semantic cache - 29 ready-to-use models β pre-squished weights on HuggingFace skip the compression step;
squish pull qwen3:8bfinishes in minutes
See MODULES.md for the full flag reference and stability tiers (Stable / Beta / Experimental).
| Resource | URL |
|---|---|
| Docs | squishai.github.io/squish |
| HuggingFace models | huggingface.co/squish-community |
| Module reference | MODULES.md |
| VS Code agent setup | docs/vscode-agent.md |
| Architecture paper | docs/paper.md |
| Contributing | CONTRIBUTING.md |
| Discord | discord.gg/squish |
Model: Qwen2.5-1.5B-Instruct Β· Hardware: Apple Silicon M-series, MLX framework
Cold mlx_lm loadβ |
Reference (mlx_lm) |
Squish (cached) | |
|---|---|---|---|
| Load time | 28.81s | 1.96s | 0.53s |
| RAM during load | ~2400 MB | ~2400 MB | 160 MB |
| Peak load RAM | ~2600 MB | ~2600 MB | 402 MB |
| Token cost | $0 (local) | $0 (local) | $0 |
| Original .safetensors needed? | β mandatory | β mandatory | β not needed |
β Cold = OS page cache cold, first process start.
Squish cached = after one-time 19s conversion; all subsequent runs.
54Γ faster cold load. 15Γ less RAM. Statistically identical outputs.
Figure 1 β Cold-start load time comparison across three configurations
Figure 2 β Peak RAM during model load
Every model you download ships in .safetensors β a format designed to move
weights between training clusters. It was never designed as a local runtime format.
When mlx_lm.load() runs, it:
- Allocates ~2.4 GB into CPU heap even though Apple Silicon has unified memory
- Converts every tensor from storage dtype to runtime dtype β every single boot
- Makes you wait 28 seconds before the first token β for data that never changes
Squish fixes all three by decoupling storage from runtime. The original files are not needed after the first run. Delete them.
FIRST RUN (~5-10 min β one-time per machine, done automatically by `squish pull`)
HuggingFace MLX weights βββΊ Squish INT4 compress βββΊ npy-dir on disk
β
ββββΊ squish_weights.safetensors (bf16, MLX-native)
ALL SUBSEQUENT RUNS (0.53s cold / 0.33s warm)
squish_weights.safetensors βββΊ mx.load() βββΊ Metal GPU map βββΊ model ready
No CPU heap allocation. No dtype conversion. Direct Metal virtual-address mapping.
| Tier | File | Load time |
|---|---|---|
| 0 | INT8 .npy tensors (Vectro compressed) |
~19s |
| 1 | finalized/*.npy (float16, per-tensor) |
~4.5s |
| 2 | squish_weights.safetensors (bf16 MLX) |
0.33β0.53s |
Figure 4 β Squish three-tier weight cache architecture
Evaluated with EleutherAI lm-evaluation-harness β the framework behind the Open LLM Leaderboard.
| Task | Reference | Squish | Ξ | Pass |
|---|---|---|---|---|
| ARC-Easy (acc_norm) | 74.5% | 73.5% | -1.0% | β |
| HellaSwag (acc_norm) | 63.5% | 62.0% | -1.5% | β |
| Winogrande (acc) | 65.5% | 67.0% | +1.5% | β |
| PIQA (acc_norm) | 77.5% | 76.5% | -1.0% | β |
Pass criterion: β€2% delta (well within measurement noise at 200 samples).
Winogrande improved by 1.5% β INT8 quantisation noise is uncorrelated with task variance.
Full reproducibility commands and multi-seed results are in docs/RESULTS.md.
Figure 3 β Accuracy delta vs fp16 baseline across benchmarks and models
Squish launched at v1.0 with a single optimization: the INT8 npy-dir format with three-tier caching. v10.0 adds 228 modules across seven phases of inference optimization, with v10 focusing on inference velocity: faster TTFT and higher decode throughput on Apple Silicon via server-wiring quick wins and six new speculative/attention algorithms. Accuracy is unchanged β every optimization preserves the β€2% delta criterion.
| Metric | Squish v1 | Squish v9 | Squish v10 | Change (v9βv10) |
|---|---|---|---|---|
| Load time (1.5B, cached) | 0.53 s | 1.61 s | ~1.61 s | negligible |
| TTFT (1.5B) | ~668 msβ | 148 ms β | ~100β130 ms β | chunked prefill + spec prefill |
| TTFT (7B) | N/A | 533 ms | ~380β460 ms | CacheWarmup + chunked |
| Decode throughput (1.5B) | 18.9 tok/s | 7.5 tok/sΒ§ | ~10β15 tok/s | FusedSampler + LayerSkip |
| KV cache β prefix reuse | none | delta-only prefill | predictive warmup | CacheWarmupPredictor |
| Sampling overhead | ~0.35 ms | ~0.35 ms | ~0.08 ms | FusedSampler 4Γ speedup |
| Total modules | 8 | 222 | 228 | +6 Wave 28 modules |
| Total test count | β | ~4,876 | 7,672 | +2,796 tests |
| ARC-Easy accuracy | 73.5% | 73.5% β | 73.5% β | unchanged |
| HellaSwag accuracy | 62.0% | 63.0% β | 63.0% β | unchanged |
| PIQA accuracy | 76.5% | 76.5% β | 76.5% β | unchanged |
| WinoGrande accuracy | 67.0% | 66.0% | 66.0% | unchanged |
β v1 streaming had a trailing-chunk artifact β all tokens arrived after ~48 s wall-clock; TTFT via /health was already 668 ms.
Β§ Measured on M3 under real system load (7 GB available RAM). Cold-dedicted-hardware throughput will be higher; spec-decode gains require a second draft model to be loaded.
Seven phases of optimization in v10:
| Phase | What it adds |
|---|---|
| 1 | Radix KV prefix sharing, PagedKV allocator, continuous batching, speculative decoding (ReDrafter) |
| 2 | Super-weight calibrator, asymmetric ternary quantization, Q-Filters, fast weight memory, LLM-42 determinism |
| 3 | Grammar engine (xgrammar), tool-calling acceleration, tag-dispatch, schema precompilation |
| 4 | MoE lookahead cache, Flash MLA, SSD acceptance predictor, Hydra speculative heads |
| 5 | Metal-fused kernels (RoPE, SwiGLU, INT8 KV attention), FFN mx.compile |
| 6 | Model pipeline, hash integrity checks, OpenAI compat suite, benchmark framework |
| 7 | FusedSampler on by default, CacheWarmup, chunked prefill universal, ToMe+LayerSkip flags, CascadeSpec, DraftMultiplexer, AsyncDecodeOverlap, PerLayerSparseAttn, SpeculativePrefiller |
Run dev/benchmarks/bench_v9_vs_v1.py to regenerate the comparison table from saved results.
Run dev/benchmarks/bench_eoe.py --output dev/results/eoe_v9.json on Apple Silicon to measure live numbers.
Run dev/benchmarks/bench_wave27_28.py to benchmark Wave 27+28 module performance.
Replace every cloud API call today. Start the server once; use it forever.
# Recommended: use the CLI
squish run 7b # port 11435 by default
# Advanced: direct invocation
python3 -m squish.server \
--model-dir ~/models/Qwen2.5-7B-Instruct-bf16 \
--compressed-dir ~/models/Qwen2.5-7B-Instruct-bf16-compressed \
--port 11435Key server flags (squish run --help for the full list):
| Flag | Values | Default | Purpose |
|---|---|---|---|
--kv-cache-mode |
fp16 Β· int8 Β· snap |
fp16 |
KV cache compression; int8 saves RAM on long contexts via KIVI INT8 + FP16 recent window; snap adds SnapKV importance-based eviction |
--kv-cache-window |
integer | 64 |
FP16 recent-token window size for int8/snap modes |
--kv-cache-budget |
integer | 4096 |
Max K/V positions retained in snap mode |
--log-level |
warning Β· info Β· debug |
warning |
Uvicorn log verbosity |
Key compress flags (squish compress --help):
| Flag | Default | Purpose |
|---|---|---|
--awq |
off | Run AWQ activation calibration before INT8/INT4 compression |
--awq-samples N |
20 |
Calibration samples for AWQ (more β better accuracy, slower) |
--int4 |
default | INT4 nibble-packed output (default for squish pull). Use squish pull --int8 to opt out. |
--int8 |
off (use on squish pull) |
Opt out of INT4; use INT8 group-64 compression instead. β Not available on squish compress (use --int4 flag there). |
--zstd-level N |
0 |
Optional zstd entropy pass after quantisation (level 3 recommended) |
Point any OpenAI client at it β no code changes:
import openai
client = openai.OpenAI(
base_url="http://localhost:11435/v1",
api_key="squish", # value ignored; no auth locally
)
# Streaming works
for chunk in client.chat.completions.create(
model="Qwen2.5-1.5B-Instruct-bf16",
messages=[{"role": "user", "content": "Explain attention mechanisms."}],
stream=True,
):
print(chunk.choices[0].delta.content or "", end="", flush=True)Works with: Python openai β₯1.0, LangChain, LlamaIndex, Continue.dev, Cursor,
any client that speaks the OpenAI wire protocol.
| Endpoint | Status |
|---|---|
POST /v1/chat/completions |
β streaming + non-streaming + tool calls |
POST /v1/completions |
β legacy text completion |
GET /v1/models |
β model listing |
GET /health |
β liveness probe |
GET /v1/metrics |
β throughput Β· queue depth Β· memory |
POST /v1/embeddings |
β mean-pool L2-normalised |
GET /chat |
β Web chat UI (browser) |
POST /api/chat |
β Ollama-compatible ndjson |
POST /api/generate |
β Ollama-compatible ndjson |
GET /api/tags |
β Ollama model listing |
GET /api/version |
β Ollama version handshake |
POST /api/embeddings |
β Ollama-compatible embeddings |
Open http://localhost:11435/chat in any browser after starting the server.
- Dark-themed, single-page app β no external services, works fully offline
- Streaming responses with live token rendering (marked.js + highlight.js)
- Conversation history persisted in
localStorage(multi-session sidebar) - Model selector auto-populated from
/v1/models - System prompt editor, settings panel (temp / top_p / max_tokens / seed)
- Copy buttons on all code blocks
Squish mounts the full Ollama HTTP API at /api/*. Any tool that speaks Ollama
will work against Squish with a single env-var change and zero code changes.
# Point any Ollama client at Squish
export OLLAMA_HOST=http://localhost:11435
# Works with the official Ollama CLI
ollama list
ollama run squish # uses /api/generate under the hood
# Works with Continue.dev, Open WebUI, Enchanted, Msty, etc.# Works with the official ollama Python library
import ollama
client = ollama.Client(host="http://localhost:11435")
response = client.chat(
model="Qwen2.5-7B-Instruct-bf16",
messages=[{"role": "user", "content": "What is entropy coding?"}],
)
print(response["message"]["content"])/v1/chat/completions accepts OpenAI-format tools and returns tool_calls
in the response. Squish injects the JSON schema into the system prompt (Qwen2.5
style) and parses the structured output automatically.
import openai, json
client = openai.OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city"],
},
},
}]
response = client.chat.completions.create(
model="Qwen2.5-7B-Instruct-bf16",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
)
if response.choices[0].finish_reason == "tool_calls":
call = response.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments)
print(f"Tool: {call.function.name}, Args: {args}")
# β Tool: get_weather, Args: {'city': 'Tokyo', 'unit': 'celsius'}Ready-made config templates live in configs/. Start Squish once, then point
any of these tools at it β no cloud API key needed for any of them.
# Copy config to Continue.dev's config directory
cp configs/continue.json ~/.continue/config.json
squish run 7b
# Re-open VS Code β Continue sidebar β Squish model appears automaticallypip install aider-chat
squish run 7b
# Use the bundled config
aider --config configs/aider.yml
# Or install globally
cp configs/aider.yml ~/.aider.conf.yml
aider # picks up config automaticallypip install litellm
squish run 7b
litellm --config configs/litellm.yaml --port 4000
# β all OpenAI clients pointing at localhost:4000 now use SquishSet the Ollama host to http://localhost:11435 β all Ollama-compatible UIs work
out of the box with zero additional configuration.
Beyond the core stable feature set, Squish includes a large library of inference optimisations.
Stable (validated on hardware): INT8/INT4 compression, KV cache compression (KIVI + SnapKV), speculative decoding, AWQ calibration, prefix/radix cache, batch scheduler, streaming, paged attention, Flash Attention, Ollama drop-in, tool calling.
Beta: Advanced KV compression (ShadowKV, PQCache, YOCO, DiffKV), additional speculative decode variants (EAGLE3, MEDUSA, KnapSpec), attention architectures (SageAttention2, GQA, ChunkedPrefill).
Experimental: Cutting-edge attention (FlashMLA, NativeSparseAttn), extended quantisation (VPTQ, FP8, MXQuant, TernaryQuant), long-context optimisations (DualChunkAttn, MInference).
See MODULES.md for the full flag reference with one-line descriptions of every supported optimisation, categorised by stability tier.
- Discord β get help, share benchmarks, discuss models
- GitHub Discussions β Q&A, ideas, show & tell
- HuggingFace β pre-squished model weights (no local compression needed)
- Contributing β good first issues, dev setup, PR guidelines
- macOS Β· Apple Silicon (M1βM5)
- Python 3.10+ (3.12 recommended)
- Dependencies install automatically via
pip install squish - Core:
mlx-lm,numpy,transformers,fastapi,uvicorn[standard],safetensors,zstandard,aiofiles,huggingface-hub - Eval extras:
pip install squish[eval]addslm-eval,datasets,accelerate - Optional: Rust quantizer (
squish_quant_rs/) for 4β6Γ faster compression throughput
| Metric | Value |
|---|---|
| Mean cosine similarity | 0.99999 |
| Min cosine similarity | 0.99995 |
| First-token agreement | 5/5 test prompts |
| Tensors quantised (INT8) | 249 / 338 |
| Tensors passthrough (fp16) | 89 / 338 |
Embeddings, layer norms, and lm_head are stored as passthrough float16.
Zero quantisation error on the prediction path.
The prior work: BitStack (ICLR 2025), Huff-LLM (Feb 2025), DFloat11, NeuZip.
None of them work on Apple Silicon. None serve an OpenAI-compatible API.
None achieve sub-second loads from a compressed format.
MLX GitHub issue #3043 (January 2026) β an open feature request to add entropy coding to MLX β is the clearest signal this gap exists and is unsolved.
Search "compressed weight" "MLX" inference "no decompression" "Apple Silicon" β zero results.
Squish INT8 compression achieves accuracy statistically equivalent to fp16 baseline
across four standard reasoning benchmarks (ARC-Easy, HellaSwag, Winogrande, PIQA),
while reducing cold-start load time by 54Γ and peak load RAM by 6Γ.
The compressed format requires zero access to the original model files
after a one-time per-device conversion.
The numbers are real. Run it yourself.

