Algorithm Overview
VeloxQuant-MLX implements nine KV cache compression algorithms. This page helps you pick the right one for your workload.
:::warning Apple Silicon required All algorithms use Metal GPU kernels and require macOS on an M-series chip. :::
Comparison table
| Algorithm | Key bits | Val bits | Calibration | Compression | Quality | Best for |
|---|---|---|---|---|---|---|
| TurboQuant RVQ | 1–3 | 2–4 | None | 7.5× | ★★★★ | General purpose, zero setup |
| VecInfer | 1–4 | 2–4 | Codebook (2 min) | 16× | ★★★★ | Max throughput, Metal-accelerated |
| RateQuant | mixed | mixed | Sensitivity (90 s) | 6–12× | ★★★★★ | Best accuracy per bit |
| SpectralQuant | 2–8 | 2–4 | SVD rotation (3 min) | 4–8× | ★★★★★ | Long context, high fidelity |
| RaBitQ | 1 | fp16 | None | 6× total | ★★★ | Key-only extreme compression |
| QJL | 1 | fp16 | None | 8× key only | ★★★ | Simplest, fastest to set up |
| PolarQuant | 1–2 | 2 | None | 8× | ★★★ | Geometric key distributions |
| CommVQ | 2–4 | fp16 | None | 4–8× | ★★★★ | RoPE-compatible models |
Compression ratios measured on Llama-3.1-8B at 4096 context. Source: BENCHMARK_RESULTS.md.
Decision guide
Do you want zero calibration?
├── Yes → TurboQuant RVQ (best quality), QJL (simplest), RaBitQ (1-bit keys)
└── No, I can spend 1–3 minutes calibrating →
├── Priority: max compression → VecInfer
├── Priority: max quality → RateQuant or SpectralQuant
└── Long sequences (8k+) → SpectralQuant
Is RoPE positional encoding compatibility critical?
└── Yes → CommVQ
Do you have geometric/non-Gaussian key distributions?
└── Yes → PolarQuant
Method families
Zero-calibration methods
These work immediately on any model with no setup beyond installation.
- TurboQuant RVQ — The recommended default. Uses analytical Gaussian + Laplacian codebooks precomputed from distribution theory. Two residual passes give excellent fidelity at 1 bit per pass.
- QJL — Johnson-Lindenstrauss 1-bit sign sketch. Provably preserves inner products in expectation. Extremely simple — great for prototyping.
- RaBitQ — Randomised Hadamard transform + 1-bit sign packing with IVF clustering. Better than QJL for key-only compression.
- PolarQuant — Recursive polar decomposition for models where keys form geometric clusters.
- CommVQ — RoPE-commutative residual VQ: quantization that commutes with rotary position embeddings, preserving exact positional information.
Calibration-required methods
These require a one-time calibration step, but deliver significantly better accuracy per bit.
- VecInfer — Product VQ with Metal-accelerated codebook lookup. Smooth scaling handles outlier dimensions. The fastest method at inference time due to fused SDPA kernels.
- RateQuant — Mixed-precision allocation via reverse-waterfilling. Probes per-layer sensitivity and allocates more bits to layers that contribute most to output quality. Best accuracy per average bit.
- SpectralQuant — SVD rotation aligns key dimensions with high-variance directions. Separate signal/noise codebooks. Best for very long contexts (8k+).
Mixing methods
The CompositeQuantizer chains multiple quantizers in sequence:
from veloxquant_mlx.quantizers.composite import CompositeQuantizer
from veloxquant_mlx.quantizers.turboquant_rvq import TurboQuantRVQ
from veloxquant_mlx.quantizers.qjl import QJLQuantizer
# RVQ for first-pass compression + QJL residual sketch
quantizer = CompositeQuantizer([
TurboQuantRVQ(bits=1),
QJLQuantizer(sketch_dim=64),
])
Per-model recommendations
| Model | Recommended algorithm | Notes |
|---|---|---|
| Llama 3.1/3.2 (7–8B) | TurboQuant RVQ 1-bit | Gaussian key distribution, zero setup |
| Mistral 7B / Mixtral | VecInfer 2-bit | Sliding window attention benefits from product VQ |
| Qwen 2.5 (7–14B) | SpectralQuant | Long-context optimised, benefits from SVD rotation |
| Phi-3 Mini | RaBitQ + CommVQ | Small head dim, CommVQ preserves RoPE exactly |
| Gemma 2B/7B | TurboQuant RVQ 2-bit | GQA benefits from slightly higher bit rate |
| Falcon 7B | RateQuant | Alibi positional bias; RateQuant adapts per-layer |
Next steps
- Pick an algorithm and read its detailed page
- mlx_lm integration guide
- Calibration guide