TurboQuant · RVQ · VecInfer · RateQuant · PolarQuant · QJL · SpectralQuant · CommVQ · RaBitQ — in MLX
Up to 16× key cache compression with near-fp16 throughput on M-series chips.
Now with custom Metal kernels — 13× faster quantize, 98% less peak memory, and RaBitQ 1-bit quantization fitting 6× more context in the same RAM. Plug into mlx_lm with three lines.
Why VeloxQuant-MLX
Measured on real Apple Silicon hardware across 10 production models. All numbers from end-to-end generation — not synthetic benchmarks.
Algorithms
Each method targets a different point on the compression–quality–throughput tradeoff curve. New in 0.7.0: RaBitQ 1-bit keys + MSE-b4 values achieves 6× full KV compression on Falcon3-7B — fitting 6× more context in the same memory budget. CommVQ adds RoPE-commutative VQ for exact attention compatibility. Mix and match per layer with RateQuant.
q @ K.T is preserved exactly. At 1 bit/elem, a 128-dim key becomes 16 bytes instead of 256 — 16× compression.
quantize_vq keeps argmin in thread-local registers — 13× faster, 98% less peak memory, no API change required.
calibrate_spectral_rotation(model, tokens) auto-detects architecture (including Gemma 4 multimodal), wraps sliding-window caches, and caches rotations to disk. Works with all mlx-community models.
quantize(rotate(x)) ≠ rotate(quantize(x)). CommVQ trains codebooks on pre-RoPE keys and projects each centroid onto the RoPE-commuting subspace (symmetrising paired dims), so RoPE can be applied exactly at decode time.
mx.hadamard_transform, O(D log D)rabitq_hamming_scoreNew in 0.5.1
The VecInfer hot path is now a 30-line Metal Shading Language shader, JIT-compiled by mx.fast.metal_kernel on first use. Same Python API; the cache auto-detects Metal and dispatches to the fast path.
Benchmarked on Apple Silicon GPU.
[N, n_centroids, sub_dim] diff tensor never gets materialized.use_metal_kernels=False for parity testing. 7 dedicated parity tests; all 212 tests pass.// One thread per sub-vector. Argmin lives in registers — no diff tensor. uint vec_idx = thread_position_in_grid.x; uint N_total = x_shape[0]; if (vec_idx >= N_total) { return; } uint n_centroids = codebook_shape[0]; uint sub_dim = codebook_shape[1]; uint x_base = vec_idx * sub_dim; float best_dist = INFINITY; uint best_idx = 0; for (uint c = 0; c < n_centroids; ++c) { uint cb_base = c * sub_dim; float dist = 0.0f; for (uint i = 0; i < sub_dim; ++i) { float d = float(x[x_base + i]) - float(codebook[cb_base + i]); dist += d * d; } if (dist < best_dist) { best_dist = dist; best_idx = c; } } out[vec_idx] = best_idx;
Quickstart
mlx_lm in 3 linesSame mlx_lm.generate API — just pass a compressed cache.
import mlx_lm from veloxquant_mlx import KVCacheBuilder, KVCacheConfig model, tokenizer = mlx_lm.load("mlx-community/Llama-3.1-8B-Instruct-4bit") # 7.5× key cache compression — within 5% of fp16 throughput config = KVCacheConfig(method="turboquant_rvq", bit_width_inlier=1, seed=42) caches = KVCacheBuilder.for_model(model, config) response = mlx_lm.generate( model, tokenizer, prompt="Write a 5,000-word analysis of the RLHF literature.", max_tokens=5000, prompt_cache=caches, )
import mlx_lm from veloxquant_mlx import KVCacheConfig, KVCacheFactory from veloxquant_mlx.allocators.vecinfer import calibrate_smooth_factors, train_codebook model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-4bit") # One-time calibration — run once, cache the results smooth = calibrate_smooth_factors(sample_keys) # [n_heads, head_dim] codebook = train_codebook(sample_keys_flat, n_centroids=256, sub_dim=8) # 16× key compression — Metal kernel auto-detected for 13x faster quantize config = KVCacheConfig( method="vecinfer", head_dim=128, key_codebook_bits=8, # 256 centroids key_sub_dim=8, # 16× compression at 1 bit/elem smooth_factors=smooth, key_codebook=codebook, use_metal_kernels=None, # None=auto, True=require, False=forbid ) caches = KVCacheFactory.create_for_model(model, config) response = mlx_lm.generate( model, tokenizer, prompt="Write a 5,000-word analysis of the RLHF literature.", max_tokens=5000, prompt_cache=caches, )
from veloxquant_mlx import ( KVCacheBuilder, KVCacheConfig, calibrate_layer_sensitivities, # 1.6s one-time probe allocate_bits_ratequant, # Theorem 2 reverse-waterfilling ) # Step 1 — probe real activations weights = calibrate_layer_sensitivities(model, tokenizer) # Step 2 — closed-form allocation; average is exact alloc = allocate_bits_ratequant(weights, target_avg_bits=1.5, beta=3.5) # alloc = [1, 2, 1, 1, 3, 1, 2, ...] one int per layer # Step 3 — build per-layer caches config = KVCacheConfig(method="turboquant_rvq", bit_width_inlier=alloc) caches = KVCacheBuilder.for_model(model, config)
import mlx_lm from veloxquant_mlx.spectral.calibrate import calibrate_spectral_rotation from veloxquant_mlx.spectral.spectral_quant import SpectralQuantizer model, tokenizer = mlx_lm.load("mlx-community/Qwen2.5-7B-Instruct-4bit") # Step 1 — one-time calibration (~10s). Rotations are cached to disk. tokens = tokenizer.encode("calibration text here", return_tensors="mlx") rotations = calibrate_spectral_rotation( model, tokens, model_name="qwen25_7b", # cached to ~/.cache/veloxquant/ ) # Step 2 — build per-layer SpectralQuant caches caches = [] for i, layer in enumerate(model.layers): key_U, _, _, _, key_ds, _ = rotations[i] sq = SpectralQuantizer( d=128, b_signal=3, b_noise=3, rotation=key_U, d_s=key_ds, apply_qjl=False, ) caches.append(sq) # wrap in mlx_lm-compatible cache # Step 3 — generate as usual; 5.95× compression, +7–10pp quality vs TurboQuant response = mlx_lm.generate( model, tokenizer, prompt="Explain quantum entanglement in simple terms.", max_tokens=1000, prompt_cache=caches, )
Benchmarks
All numbers from end-to-end mlx_lm.generate with a 200-token prompt, 120-token generation. Apple M-series, unified memory.
| Model | FP16 (tok/s) | RVQ-1bit | RVQ compress | VecInfer-1bit | VI compress | Best pick |
|---|---|---|---|---|---|---|
| SmolLM2-135M | 250.4 | 188.5 | 7.1× | 175.8 | 16× | RVQ-1bit |
| Llama-3.2-1B | 105.4 | 104.3 | 7.1× | 91.2 | 16× | RVQ-1bit |
| Llama-3.2-3B | 47.6 | 46.2 | 7.5× | 40.2 | 16× | RVQ-1bit |
| Llama-3.1-8B | 20.5 | 20.6 | 7.5× | 19.6 | 16× | RVQ-1bit |
| Mistral-7B | 23.6 | 22.8 | 7.5× | 9.8 | 16× | RVQ-1bit |
| Qwen2.5-7B | 21.0 | 20.7 | 7.5× | 21.5 ↑ fp16 | 16× | VecInfer-1bit |
| Qwen3-8B | 20.3 | 19.6 | 7.5× | 2.4 | 16× | RVQ-1bit |
| Phi-4 | 10.4 | 8.1 | 7.5× | 4.0 | 16× | TQ-2bit (9.6) |
| Falcon3-7B | 17.3 | 21.7 | 7.8× | 17.0 | 16× | RVQ-1bit |
| gemma-3-4b | 26.0 | 24.2 | 7.8× | 22.6 | 16× | VecInfer-1bit |
Decision guide
Installation
Requires Apple Silicon (M1 or later) and Python 3.11+.
pip install VeloxQuant-MLX
pip install "VeloxQuant-MLX[dev]"
git clone https://github.com/rajveer43/VeloxQuant-MLX
cd VeloxQuant-MLX
pip install -e ".[dev]"