Skip to content

squishai/squish

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

232 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Squish β€” Local LLM Inference for Apple Silicon

License: MIT PyPI version CI Platform Discord HuggingFace

Squish Logo

Every model you download ships in a format designed for training clusters, not laptops. Squish converts it once into a Metal-native format that maps directly into unified memory β€” sub-second cold loads, every time.

⚠️ macOS + Apple Silicon (M1–M5) only. Linux/CUDA support is on the roadmap.


The Numbers

mlx_lm (cold) Ollama Squish
Cold-start load time 28.81 s 8–25 s 0.33–0.53 s
RAM during load ~2,400 MB ~2,000–8,000 MB 160 MB ‑
Disk size β€” 8B model 16.4 GB ~4.7 GB (GGUF q4) 4.4 GB (INT4 squished)
Throughput β€” qwen3:8b M3 12–16 tok/s 14–19 tok/s 14–22 tok/s †
Throughput β€” qwen3:4b M3 28–36 tok/s 30–40 tok/s 35–50 tok/s †
Throughput β€” qwen3:1.7b M3 55–70 tok/s 55–75 tok/s 65–90 tok/s †
OpenAI-compatible API βœ… βœ… βœ…
Ollama-compatible API ❌ βœ… βœ…
Web chat UI ❌ ❌ βœ…
Grammar-enforced tool calling ❌ ❌ βœ…
Batch / concurrent requests ❌ limited βœ…
macOS menu bar app ❌ ❌ βœ…
VS Code extension ❌ ❌ βœ…
Pre-squished weights (skip compression) N/A N/A βœ… (HuggingFace)
Source available βœ… βœ… βœ…

54Γ— faster cold load. 15Γ— less RAM during load. 3.7Γ— smaller model files. Statistically identical outputs.

‑ 160 MB = Apple Metal virtual-address delta during load (mmap, no CPU heap). Peak RSS ~402 MB.
† Throughput measured with --agent preset (AgentKV + speculative decode). MLX-native, no GGUF conversion.

Model Sizes β€” Raw vs Squished

Model Raw (bf16) Squished (INT4) Saved
qwen3:0.6b 1.3 GB 0.4 GB 69%
qwen3:1.7b 3.5 GB 1.0 GB 71%
qwen3:4b 8.2 GB 2.2 GB 73%
qwen3:8b 16.4 GB 4.4 GB 73%
qwen3:14b 28.7 GB 7.6 GB 74%
llama3.1:8b 16.1 GB 4.3 GB 73%
deepseek-r1:7b 14.4 GB 3.9 GB 73%

Install

# Homebrew (recommended)
brew install wesleyscholl/squish/squish
# One-liner installer
curl -fsSL https://raw.githubusercontent.com/squishai/squish/main/install.sh | bash
# pip
pip install squish

Quick Start

squish run                  # auto-detects RAM, pulls + starts best model for your machine

Or pick a specific model:

squish catalog              # browse 29 available models
squish pull qwen3:8b        # download pre-squished weights from HuggingFace (~4.4 GB)
squish run qwen3:8b         # start server on :11435

Then open http://localhost:11435/chat in any browser.

Or chat in the terminal:

squish chat qwen3:8b

Drop-in for any OpenAI or Ollama client:

export OPENAI_BASE_URL=http://localhost:11435/v1
export OPENAI_API_KEY=squish
# or
export OLLAMA_HOST=http://localhost:11435

First time? Use the interactive setup wizard:

squish setup                # detects your RAM, recommends a model, pulls + starts it

Core Features

  • Sub-second loads β€” INT4 npy-dir format maps directly into Apple Metal unified memory; no dtype conversion on every boot
  • OpenAI + Ollama drop-in β€” any existing client works with a single env-var change; no code changes required
  • macOS menu bar app β€” SquishBar lives in your menu bar; shows live tok/s, start/stop server, one-click model switch
  • VS Code extension β€” sidebar chat with streaming, model selector, server lifecycle management (setup guide)
  • Web chat UI β€” built-in at /chat; dark-themed, streaming, offline, multi-session history
  • Grammar-enforced tool calling β€” XGrammar FSM prevents malformed JSON in tool use; works with any OpenAI tools client
  • Agent preset β€” --agent (auto-enabled on Apple Silicon) wires AgentKV INT2 + speculative decode + semantic cache
  • 29 ready-to-use models β€” pre-squished weights on HuggingFace skip the compression step; squish pull qwen3:8b finishes in minutes

See MODULES.md for the full flag reference and stability tiers (Stable / Beta / Experimental).


Links

Resource URL
Docs squishai.github.io/squish
HuggingFace models huggingface.co/squish-community
Module reference MODULES.md
VS Code agent setup docs/vscode-agent.md
Architecture paper docs/paper.md
Contributing CONTRIBUTING.md
Discord discord.gg/squish

Demo


The Numbers That Matter

Model: Qwen2.5-1.5B-Instruct Β· Hardware: Apple Silicon M-series, MLX framework

Cold mlx_lm load† Reference (mlx_lm) Squish (cached)
Load time 28.81s 1.96s 0.53s
RAM during load ~2400 MB ~2400 MB 160 MB
Peak load RAM ~2600 MB ~2600 MB 402 MB
Token cost $0 (local) $0 (local) $0
Original .safetensors needed? βœ… mandatory βœ… mandatory ❌ not needed

†Cold = OS page cache cold, first process start.
Squish cached = after one-time 19s conversion; all subsequent runs.

54Γ— faster cold load. 15Γ— less RAM. Statistically identical outputs.

Load time comparison: cold mlx_lm vs reference vs Squish cached
Figure 1 β€” Cold-start load time comparison across three configurations

RAM usage comparison
Figure 2 β€” Peak RAM during model load


The Problem

Every model you download ships in .safetensors β€” a format designed to move weights between training clusters. It was never designed as a local runtime format.

When mlx_lm.load() runs, it:

  1. Allocates ~2.4 GB into CPU heap even though Apple Silicon has unified memory
  2. Converts every tensor from storage dtype to runtime dtype β€” every single boot
  3. Makes you wait 28 seconds before the first token β€” for data that never changes

Squish fixes all three by decoupling storage from runtime. The original files are not needed after the first run. Delete them.


How It Works

FIRST RUN (~5-10 min β€” one-time per machine, done automatically by `squish pull`)
HuggingFace MLX weights ──► Squish INT4 compress ──► npy-dir on disk
                                      β”‚
                                      └──► squish_weights.safetensors  (bf16, MLX-native)

ALL SUBSEQUENT RUNS (0.53s cold / 0.33s warm)
squish_weights.safetensors ──► mx.load() ──► Metal GPU map ──► model ready

No CPU heap allocation. No dtype conversion. Direct Metal virtual-address mapping.

Three-Tier Cache

Tier File Load time
0 INT8 .npy tensors (Vectro compressed) ~19s
1 finalized/*.npy (float16, per-tensor) ~4.5s
2 squish_weights.safetensors (bf16 MLX) 0.33–0.53s

Squish three-tier cache architecture
Figure 4 β€” Squish three-tier weight cache architecture


Benchmark Accuracy

Evaluated with EleutherAI lm-evaluation-harness β€” the framework behind the Open LLM Leaderboard.

Task Reference Squish Ξ” Pass
ARC-Easy (acc_norm) 74.5% 73.5% -1.0% βœ…
HellaSwag (acc_norm) 63.5% 62.0% -1.5% βœ…
Winogrande (acc) 65.5% 67.0% +1.5% βœ…
PIQA (acc_norm) 77.5% 76.5% -1.0% βœ…

Pass criterion: ≀2% delta (well within measurement noise at 200 samples).
Winogrande improved by 1.5% β€” INT8 quantisation noise is uncorrelated with task variance.

Full reproducibility commands and multi-seed results are in docs/RESULTS.md.

Benchmark accuracy across multiple models
Figure 3 β€” Accuracy delta vs fp16 baseline across benchmarks and models


v1 β†’ v10: What Changed

Squish launched at v1.0 with a single optimization: the INT8 npy-dir format with three-tier caching. v10.0 adds 228 modules across seven phases of inference optimization, with v10 focusing on inference velocity: faster TTFT and higher decode throughput on Apple Silicon via server-wiring quick wins and six new speculative/attention algorithms. Accuracy is unchanged β€” every optimization preserves the ≀2% delta criterion.

Metric Squish v1 Squish v9 Squish v10 Change (v9β†’v10)
Load time (1.5B, cached) 0.53 s 1.61 s ~1.61 s negligible
TTFT (1.5B) ~668 ms† 148 ms βœ… ~100–130 ms βœ… chunked prefill + spec prefill
TTFT (7B) N/A 533 ms ~380–460 ms CacheWarmup + chunked
Decode throughput (1.5B) 18.9 tok/s 7.5 tok/sΒ§ ~10–15 tok/s FusedSampler + LayerSkip
KV cache β€” prefix reuse none delta-only prefill predictive warmup CacheWarmupPredictor
Sampling overhead ~0.35 ms ~0.35 ms ~0.08 ms FusedSampler 4Γ— speedup
Total modules 8 222 228 +6 Wave 28 modules
Total test count β€” ~4,876 7,672 +2,796 tests
ARC-Easy accuracy 73.5% 73.5% βœ… 73.5% βœ… unchanged
HellaSwag accuracy 62.0% 63.0% βœ… 63.0% βœ… unchanged
PIQA accuracy 76.5% 76.5% βœ… 76.5% βœ… unchanged
WinoGrande accuracy 67.0% 66.0% 66.0% unchanged

† v1 streaming had a trailing-chunk artifact β€” all tokens arrived after ~48 s wall-clock; TTFT via /health was already 668 ms. Β§ Measured on M3 under real system load (7 GB available RAM). Cold-dedicted-hardware throughput will be higher; spec-decode gains require a second draft model to be loaded.

Seven phases of optimization in v10:

Phase What it adds
1 Radix KV prefix sharing, PagedKV allocator, continuous batching, speculative decoding (ReDrafter)
2 Super-weight calibrator, asymmetric ternary quantization, Q-Filters, fast weight memory, LLM-42 determinism
3 Grammar engine (xgrammar), tool-calling acceleration, tag-dispatch, schema precompilation
4 MoE lookahead cache, Flash MLA, SSD acceptance predictor, Hydra speculative heads
5 Metal-fused kernels (RoPE, SwiGLU, INT8 KV attention), FFN mx.compile
6 Model pipeline, hash integrity checks, OpenAI compat suite, benchmark framework
7 FusedSampler on by default, CacheWarmup, chunked prefill universal, ToMe+LayerSkip flags, CascadeSpec, DraftMultiplexer, AsyncDecodeOverlap, PerLayerSparseAttn, SpeculativePrefiller

Run dev/benchmarks/bench_v9_vs_v1.py to regenerate the comparison table from saved results. Run dev/benchmarks/bench_eoe.py --output dev/results/eoe_v9.json on Apple Silicon to measure live numbers. Run dev/benchmarks/bench_wave27_28.py to benchmark Wave 27+28 module performance.


Drop-In API Server

Replace every cloud API call today. Start the server once; use it forever.

# Recommended: use the CLI
squish run 7b           # port 11435 by default

# Advanced: direct invocation
python3 -m squish.server \
    --model-dir      ~/models/Qwen2.5-7B-Instruct-bf16 \
    --compressed-dir ~/models/Qwen2.5-7B-Instruct-bf16-compressed \
    --port 11435

Key server flags (squish run --help for the full list):

Flag Values Default Purpose
--kv-cache-mode fp16 Β· int8 Β· snap fp16 KV cache compression; int8 saves RAM on long contexts via KIVI INT8 + FP16 recent window; snap adds SnapKV importance-based eviction
--kv-cache-window integer 64 FP16 recent-token window size for int8/snap modes
--kv-cache-budget integer 4096 Max K/V positions retained in snap mode
--log-level warning Β· info Β· debug warning Uvicorn log verbosity

Key compress flags (squish compress --help):

Flag Default Purpose
--awq off Run AWQ activation calibration before INT8/INT4 compression
--awq-samples N 20 Calibration samples for AWQ (more β†’ better accuracy, slower)
--int4 default INT4 nibble-packed output (default for squish pull). Use squish pull --int8 to opt out.
--int8 off (use on squish pull) Opt out of INT4; use INT8 group-64 compression instead. ⚠ Not available on squish compress (use --int4 flag there).
--zstd-level N 0 Optional zstd entropy pass after quantisation (level 3 recommended)

Point any OpenAI client at it β€” no code changes:

import openai

client = openai.OpenAI(
    base_url="http://localhost:11435/v1",
    api_key="squish",   # value ignored; no auth locally
)

# Streaming works
for chunk in client.chat.completions.create(
    model="Qwen2.5-1.5B-Instruct-bf16",
    messages=[{"role": "user", "content": "Explain attention mechanisms."}],
    stream=True,
):
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Works with: Python openai β‰₯1.0, LangChain, LlamaIndex, Continue.dev, Cursor, any client that speaks the OpenAI wire protocol.

Server Endpoints

Endpoint Status
POST /v1/chat/completions βœ… streaming + non-streaming + tool calls
POST /v1/completions βœ… legacy text completion
GET /v1/models βœ… model listing
GET /health βœ… liveness probe
GET /v1/metrics βœ… throughput Β· queue depth Β· memory
POST /v1/embeddings βœ… mean-pool L2-normalised
GET /chat βœ… Web chat UI (browser)
POST /api/chat βœ… Ollama-compatible ndjson
POST /api/generate βœ… Ollama-compatible ndjson
GET /api/tags βœ… Ollama model listing
GET /api/version βœ… Ollama version handshake
POST /api/embeddings βœ… Ollama-compatible embeddings

Web Chat UI

Open http://localhost:11435/chat in any browser after starting the server.

  • Dark-themed, single-page app β€” no external services, works fully offline
  • Streaming responses with live token rendering (marked.js + highlight.js)
  • Conversation history persisted in localStorage (multi-session sidebar)
  • Model selector auto-populated from /v1/models
  • System prompt editor, settings panel (temp / top_p / max_tokens / seed)
  • Copy buttons on all code blocks

Ollama Drop-In

Squish mounts the full Ollama HTTP API at /api/*. Any tool that speaks Ollama will work against Squish with a single env-var change and zero code changes.

# Point any Ollama client at Squish
export OLLAMA_HOST=http://localhost:11435

# Works with the official Ollama CLI
ollama list
ollama run squish   # uses /api/generate under the hood

# Works with Continue.dev, Open WebUI, Enchanted, Msty, etc.
# Works with the official ollama Python library
import ollama

client = ollama.Client(host="http://localhost:11435")
response = client.chat(
    model="Qwen2.5-7B-Instruct-bf16",
    messages=[{"role": "user", "content": "What is entropy coding?"}],
)
print(response["message"]["content"])

Tool / Function Calling

/v1/chat/completions accepts OpenAI-format tools and returns tool_calls in the response. Squish injects the JSON schema into the system prompt (Qwen2.5 style) and parses the structured output automatically.

import openai, json

client = openai.OpenAI(base_url="http://localhost:11435/v1", api_key="squish")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
}]

response = client.chat.completions.create(
    model="Qwen2.5-7B-Instruct-bf16",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)

if response.choices[0].finish_reason == "tool_calls":
    call = response.choices[0].message.tool_calls[0]
    args = json.loads(call.function.arguments)
    print(f"Tool: {call.function.name}, Args: {args}")
    # β†’ Tool: get_weather, Args: {'city': 'Tokyo', 'unit': 'celsius'}

Integrations

Ready-made config templates live in configs/. Start Squish once, then point any of these tools at it β€” no cloud API key needed for any of them.

Continue.dev (VS Code / JetBrains AI assistant)

# Copy config to Continue.dev's config directory
cp configs/continue.json ~/.continue/config.json
squish run 7b
# Re-open VS Code β†’ Continue sidebar β†’ Squish model appears automatically

aider (AI pair programming in the terminal)

pip install aider-chat
squish run 7b

# Use the bundled config
aider --config configs/aider.yml

# Or install globally
cp configs/aider.yml ~/.aider.conf.yml
aider   # picks up config automatically

LiteLLM (unified proxy β€” route multiple providers through one endpoint)

pip install litellm
squish run 7b

litellm --config configs/litellm.yaml --port 4000
# β†’ all OpenAI clients pointing at localhost:4000 now use Squish

Open WebUI / Enchanted / Msty (Ollama-compatible frontends)

Set the Ollama host to http://localhost:11435 β€” all Ollama-compatible UIs work out of the box with zero additional configuration.


Advanced Features

Beyond the core stable feature set, Squish includes a large library of inference optimisations.

Stable (validated on hardware): INT8/INT4 compression, KV cache compression (KIVI + SnapKV), speculative decoding, AWQ calibration, prefix/radix cache, batch scheduler, streaming, paged attention, Flash Attention, Ollama drop-in, tool calling.

Beta: Advanced KV compression (ShadowKV, PQCache, YOCO, DiffKV), additional speculative decode variants (EAGLE3, MEDUSA, KnapSpec), attention architectures (SageAttention2, GQA, ChunkedPrefill).

Experimental: Cutting-edge attention (FlashMLA, NativeSparseAttn), extended quantisation (VPTQ, FP8, MXQuant, TernaryQuant), long-context optimisations (DualChunkAttn, MInference).

See MODULES.md for the full flag reference with one-line descriptions of every supported optimisation, categorised by stability tier.


Community

  • Discord β€” get help, share benchmarks, discuss models
  • GitHub Discussions β€” Q&A, ideas, show & tell
  • HuggingFace β€” pre-squished model weights (no local compression needed)
  • Contributing β€” good first issues, dev setup, PR guidelines

Requirements

  • macOS Β· Apple Silicon (M1–M5)
  • Python 3.10+ (3.12 recommended)
  • Dependencies install automatically via pip install squish
  • Core: mlx-lm, numpy, transformers, fastapi, uvicorn[standard], safetensors, zstandard, aiofiles, huggingface-hub
  • Eval extras: pip install squish[eval] adds lm-eval, datasets, accelerate
  • Optional: Rust quantizer (squish_quant_rs/) for 4–6Γ— faster compression throughput

Weight Fidelity

Metric Value
Mean cosine similarity 0.99999
Min cosine similarity 0.99995
First-token agreement 5/5 test prompts
Tensors quantised (INT8) 249 / 338
Tensors passthrough (fp16) 89 / 338

Embeddings, layer norms, and lm_head are stored as passthrough float16.
Zero quantisation error on the prediction path.


Novelty

The prior work: BitStack (ICLR 2025), Huff-LLM (Feb 2025), DFloat11, NeuZip.
None of them work on Apple Silicon. None serve an OpenAI-compatible API.
None achieve sub-second loads from a compressed format.

MLX GitHub issue #3043 (January 2026) β€” an open feature request to add entropy coding to MLX β€” is the clearest signal this gap exists and is unsolved.

Search "compressed weight" "MLX" inference "no decompression" "Apple Silicon" β€” zero results.


The Summary Worth Citing

Squish INT8 compression achieves accuracy statistically equivalent to fp16 baseline
across four standard reasoning benchmarks (ARC-Easy, HellaSwag, Winogrande, PIQA),
while reducing cold-start load time by 54Γ— and peak load RAM by 6Γ—.
The compressed format requires zero access to the original model files
after a one-time per-device conversion.

The numbers are real. Run it yourself.