Skip to main content

Algorithm Overview

VeloxQuant-MLX implements nine KV cache compression algorithms. This page helps you pick the right one for your workload.

:::warning Apple Silicon required All algorithms use Metal GPU kernels and require macOS on an M-series chip. :::

Comparison table

AlgorithmKey bitsVal bitsCalibrationCompressionQualityBest for
TurboQuant RVQ1–32–4None7.5×★★★★General purpose, zero setup
VecInfer1–42–4Codebook (2 min)16×★★★★Max throughput, Metal-accelerated
RateQuantmixedmixedSensitivity (90 s)6–12×★★★★★Best accuracy per bit
SpectralQuant2–82–4SVD rotation (3 min)4–8×★★★★★Long context, high fidelity
RaBitQ1fp16None6× total★★★Key-only extreme compression
QJL1fp16None8× key only★★★Simplest, fastest to set up
PolarQuant1–22None★★★Geometric key distributions
CommVQ2–4fp16None4–8×★★★★RoPE-compatible models

Compression ratios measured on Llama-3.1-8B at 4096 context. Source: BENCHMARK_RESULTS.md.

Decision guide

Do you want zero calibration?
├── Yes → TurboQuant RVQ (best quality), QJL (simplest), RaBitQ (1-bit keys)
└── No, I can spend 1–3 minutes calibrating →
├── Priority: max compression → VecInfer
├── Priority: max quality → RateQuant or SpectralQuant
└── Long sequences (8k+) → SpectralQuant

Is RoPE positional encoding compatibility critical?
└── Yes → CommVQ

Do you have geometric/non-Gaussian key distributions?
└── Yes → PolarQuant

Method families

Zero-calibration methods

These work immediately on any model with no setup beyond installation.

  • TurboQuant RVQ — The recommended default. Uses analytical Gaussian + Laplacian codebooks precomputed from distribution theory. Two residual passes give excellent fidelity at 1 bit per pass.
  • QJL — Johnson-Lindenstrauss 1-bit sign sketch. Provably preserves inner products in expectation. Extremely simple — great for prototyping.
  • RaBitQ — Randomised Hadamard transform + 1-bit sign packing with IVF clustering. Better than QJL for key-only compression.
  • PolarQuant — Recursive polar decomposition for models where keys form geometric clusters.
  • CommVQ — RoPE-commutative residual VQ: quantization that commutes with rotary position embeddings, preserving exact positional information.

Calibration-required methods

These require a one-time calibration step, but deliver significantly better accuracy per bit.

  • VecInfer — Product VQ with Metal-accelerated codebook lookup. Smooth scaling handles outlier dimensions. The fastest method at inference time due to fused SDPA kernels.
  • RateQuant — Mixed-precision allocation via reverse-waterfilling. Probes per-layer sensitivity and allocates more bits to layers that contribute most to output quality. Best accuracy per average bit.
  • SpectralQuant — SVD rotation aligns key dimensions with high-variance directions. Separate signal/noise codebooks. Best for very long contexts (8k+).

Mixing methods

The CompositeQuantizer chains multiple quantizers in sequence:

from veloxquant_mlx.quantizers.composite import CompositeQuantizer
from veloxquant_mlx.quantizers.turboquant_rvq import TurboQuantRVQ
from veloxquant_mlx.quantizers.qjl import QJLQuantizer

# RVQ for first-pass compression + QJL residual sketch
quantizer = CompositeQuantizer([
TurboQuantRVQ(bits=1),
QJLQuantizer(sketch_dim=64),
])

Per-model recommendations

ModelRecommended algorithmNotes
Llama 3.1/3.2 (7–8B)TurboQuant RVQ 1-bitGaussian key distribution, zero setup
Mistral 7B / MixtralVecInfer 2-bitSliding window attention benefits from product VQ
Qwen 2.5 (7–14B)SpectralQuantLong-context optimised, benefits from SVD rotation
Phi-3 MiniRaBitQ + CommVQSmall head dim, CommVQ preserves RoPE exactly
Gemma 2B/7BTurboQuant RVQ 2-bitGQA benefits from slightly higher bit rate
Falcon 7BRateQuantAlibi positional bias; RateQuant adapts per-layer

Next steps