VeloxQuant-MLX — Fast KV Cache Quantization for Apple Silicon

What is a KV cache?

Every time an LLM generates a token, it reads back the key and value tensors for every previous token in the context. These tensors are stored in RAM — that's the KV cache. At long contexts, it dominates memory: a 7B model at 128k tokens needs ~25 GB of KV cache alone, even on a compressed model.

What does VeloxQuant-MLX do?

It quantizes those tensors on the fly — trading a few bits of precision for dramatically less RAM. At 16× compression, a cache that needed 25 GB needs under 2 GB. Same model. More context. Runs on your Mac with near-identical output quality and unchanged mlx_lm API.

Why not just use what I have?

How this compares to what you're already running

llama.cpp already quantizes the KV cache. oMLX serves mlx_lm models but doesn't compress the cache. Plain mlx_lm doesn't compress it at all. Here's the honest difference.

	Plain mlx_lm	llama.cpp / Ollama / LM Studio	oMLX	VeloxQuant-MLX
What it is	Model loading + generation library	Standalone inference runtime (C/C++)	Serving layer on top of mlx_lm	KV cache compression on top of mlx_lm
KV cache precision	fp16, always	Fixed: `q8_0` or `q4_0`	fp16 by default; one experimental toggle	1–8 bit, chosen per method
Compression schemes	None	One uniform scheme, same for every layer	One experimental scheme, independently implemented	41 methods
Token eviction	No	No	No	Yes — 11 methods
Cross-layer compression	No	No	No	Yes — XQuant, MiniCache, xKV
Memory strategy	None	On-device only	SSD-tiered paging (hot RAM / cold SSD)	In-memory compression
Per-layer / per-head tuning	N/A	No — one setting, whole model	No	Yes, per layer
Model format	MLX safetensors	GGUF	MLX safetensors	MLX safetensors — same as mlx_lm
Hardware support	Apple Silicon only	Broadest — Apple Silicon, x86, Linux, Windows, mobile	Apple Silicon only	Apple Silicon only
Integration	Native	Native	brew install / CLI server	3 lines on top of mlx_lm

A note on oMLX + "TurboQuant": some articles describe oMLX as shipping TurboQuant KV cache support. That refers to Google's TurboQuant paper (arXiv:2504.19874) — the same family this library's turboquant_rvq method implements — independently, not a dependency on this package. The toggle has also been reported broken or removed across oMLX releases.

Where llama.cpp wins

It's the more mature, more portable project — runs on every platform, not just Apple Silicon. Its q4_0/q8_0 KV quantization is production-tested at massive scale through Ollama and LM Studio, and needs zero configuration. If you're not on a Mac, or you want the most battle-tested path with the least decision-making, that's the right default.

Where oMLX fits

A different axis entirely — it's a serving layer, not a compression scheme. Continuous batching, an OpenAI/Anthropic-compatible API, and SSD-tiered paging for long-idle sessions. Not mutually exclusive with VeloxQuant-MLX — both build on mlx_lm — but that combination isn't a supported path here today.

Where VeloxQuant-MLX wins

q4_0 is one fixed 4-bit scheme applied uniformly, with no eviction, ever. VeloxQuant-MLX goes lower — up to 16× key compression — with methods tuned for RoPE preservation, geometric key clusters, and per-layer budgets, plus 11 eviction methods that drop stale tokens entirely. All on top of the mlx_lm stack you're likely already running.

No head-to-head throughput numbers against llama.cpp's cache are published here — different runtime, different hardware paths. The honest comparison today is architectural, not a benchmark race. Full write-up →

Quickstart

Plug into `mlx_lm` in 3 lines

Same mlx_lm.generate API — just pass a compressed cache.

New here? See the Installation guide →

          python — RVQ 1-bit · 7.5× compression · no calibration
          
        
import mlx_lm
from veloxquant_mlx import KVCacheBuilder, KVCacheConfig

model, tokenizer = mlx_lm.load("mlx-community/Llama-3.1-8B-Instruct-4bit")

# 7.5× key cache compression — within 5% of fp16 throughput
config = KVCacheConfig(method="turboquant_rvq", bit_width_inlier=1, seed=42)
caches = KVCacheBuilder.for_model(model, config)

response = mlx_lm.generate(
    model, tokenizer,
    prompt="Write a 5,000-word analysis of the RLHF literature.",
    max_tokens=5000,
    prompt_cache=caches,
)

          python — VecInfer 1-bit · 16× compression · Metal kernels auto-detected
          
        
import mlx_lm
from veloxquant_mlx import KVCacheConfig, KVCacheFactory
from veloxquant_mlx.allocators.vecinfer import calibrate_smooth_factors, train_codebook

model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-4bit")

# One-time calibration — run once, cache the results
smooth = calibrate_smooth_factors(sample_keys)   # [n_heads, head_dim]
codebook = train_codebook(sample_keys_flat, n_centroids=256, sub_dim=8)

# 16× key compression — Metal kernel auto-detected for 13x faster quantize
config = KVCacheConfig(
    method="vecinfer",
    head_dim=128,
    key_codebook_bits=8,      # 256 centroids
    key_sub_dim=8,             # 16× compression at 1 bit/elem
    smooth_factors=smooth,
    key_codebook=codebook,
    use_metal_kernels=None,      # None=auto, True=require, False=forbid
)
caches = KVCacheFactory.create_for_model(model, config)

response = mlx_lm.generate(
    model, tokenizer,
    prompt="Write a 5,000-word analysis of the RLHF literature.",
    max_tokens=5000,
    prompt_cache=caches,
)

          python — RateQuant · per-layer mixed precision · 1.5-bit average
          
        
from veloxquant_mlx import (
    KVCacheBuilder, KVCacheConfig,
    calibrate_layer_sensitivities,   # 1.6s one-time probe
    allocate_bits_ratequant,          # Theorem 2 reverse-waterfilling
)

# Step 1 — probe real activations
weights = calibrate_layer_sensitivities(model, tokenizer)

# Step 2 — closed-form allocation; average is exact
alloc = allocate_bits_ratequant(weights, target_avg_bits=1.5, beta=3.5)
# alloc = [1, 2, 1, 1, 3, 1, 2, ...]  one int per layer

# Step 3 — build per-layer caches
config = KVCacheConfig(method="turboquant_rvq", bit_width_inlier=alloc)
caches = KVCacheBuilder.for_model(model, config)

Benchmarks

12 models · Apple Silicon · end-to-end generation

All numbers from end-to-end mlx_lm.generate with a 200-token prompt, 120-token generation. Apple M-series, unified memory. Bar width = compression ratio relative to 16× max.

Model	FP16 baseline tok/s	RVQ-1bit tok/s vs fp16	RVQ compression key cache ratio	VecInfer-1bit tok/s vs fp16	VecInfer compression key cache ratio	Best pick
SmolLM2-135M	250.4	188.5	7.1×	175.8	16×	RVQ-1bit
Llama-3.2-1B	105.4	104.3	7.1×	91.2	16×	RVQ-1bit
Llama-3.2-3B	47.6	46.2	7.5×	40.2	16×	RVQ-1bit
Llama-3.1-8B	20.5	20.6	7.5×	19.6	16×	RVQ-1bit
Mistral-7B	23.6	22.8	7.5×	9.8	16×	RVQ-1bit
Qwen2.5-7B	21.0	20.7	7.5×	21.5 ↑ fp16	16×	VecInfer-1bit
Qwen3-8B	20.3	19.6	7.5×	2.4	16×	RVQ-1bit
Phi-4	10.4	8.1	7.5×	4.0	16×	TQ-2bit (9.6)
Falcon3-7B	17.3	21.7	7.8×	17.0	16×	RVQ-1bit
gemma-3-4b	26.0	24.2	7.8×	22.6	16×	VecInfer-1bit

Highlighted rows — Qwen2.5-7B VecInfer-1bit exceeds fp16 throughput at 16× compression (strong GQA: 28q/4kv heads). gemma-3-4b matches fp16 at 16×. Bar widths show compression ratio relative to the 16× maximum. Full per-model plots and raw JSON in figures/vecinfer/ in the repo.

See the full algorithm details and decision guide in the algorithm reference →.

Installation

Get started in seconds

Requires Apple Silicon (M1 or later) and Python 3.11+.

pip

pip install VeloxQuant-MLX

pip — dev extras

pip install "VeloxQuant-MLX[dev]"

from source

git clone https://github.com/rajveer43/VeloxQuant-MLX
cd VeloxQuant-MLX
pip install -e ".[dev]"

Apple Silicon M1 or later (M2/M3/M4 recommended)
Python 3.11 or 3.12
MLX ≥ 0.18
NumPy ≥ 1.26
Matplotlib ≥ 3.8 (for benchmark plots)
MIT License — free for commercial use
12 production models validated
1484/1484 tests passing

Run longer LLM contexts
on your Mac

Four reasons to add it

How this compares to what you're already running

Plug into `mlx_lm` in 3 lines

12 models · Apple Silicon · end-to-end generation

Get started in seconds

Run longer LLM contextson your Mac

Four reasons to add it

How this compares to what you're already running

Plug into mlx_lm in 3 lines

12 models · Apple Silicon · end-to-end generation

Get started in seconds

Run longer LLM contexts
on your Mac

Plug into `mlx_lm` in 3 lines