What is VeloxQuant-MLX?
VeloxQuant-MLX is a production-grade KV cache compression library for Apple Silicon (M-series Macs). It implements nine quantization algorithms that compress the key-value cache used during LLM inference — reducing peak memory by up to 98% while maintaining near-lossless output quality.
LLMs like Llama, Mistral, and Qwen store past context in a KV cache that grows linearly with sequence length. On a MacBook M3 Pro with 18 GB unified memory, a 7B model at 8k context can consume 14 GB of cache alone — leaving almost no room for anything else. VeloxQuant-MLX compresses that cache on-the-fly with Metal GPU kernels, making long-context inference practical on consumer hardware.
Why Apple Silicon?
Apple's M-series chips have a unique advantage: unified memory. The GPU and CPU share the same memory pool, which means there is no PCIe bandwidth bottleneck between host and device. VeloxQuant-MLX is built specifically around this architecture:
- Metal GPU kernels run quantization/dequantization directly on the Neural Engine and GPU cores
- MLX — Apple's ML framework — provides the tensor primitives; VeloxQuant-MLX sits on top of it
- Quantized KV cache stays in unified memory, accessed by both the attention kernel and the quantizer with zero copies
Key metrics
| Metric | Value |
|---|---|
| Max key cache compression | 16× (VecInfer 1-bit) |
| Metal kernel speedup | 13× faster quantization |
| Peak memory reduction | up to 98% |
| RVQ-1bit compression | 7.5× with zero calibration |
| RaBitQ full KV | 6× (keys + values) |
| Validated models | 12 (Llama, Mistral, Qwen, Phi, Gemma, Falcon) |
| Test suite | 212+ passing tests |
Algorithm overview
VeloxQuant-MLX provides nine algorithms ranging from zero-calibration 1-bit methods to sophisticated mixed-precision allocators:
| Algorithm | Bits | Calibration | Best for |
|---|---|---|---|
| TurboQuant RVQ | 1–3+ | None | General purpose, drop-in replacement |
| VecInfer | 1–4 | Codebook training | Maximum throughput |
| RateQuant | mixed | 90 seconds | Mixed-precision accuracy-memory tradeoffs |
| SpectralQuant | 2–8 | SVD rotation | High-accuracy long context |
| RaBitQ | 1 | None | Key-only extreme compression |
| QJL | 1 | None | Simplest, fastest |
| PolarQuant | 1–2 | None | Geometric key distributions |
| CommVQ | 2–4 | None | RoPE-compatible residual VQ |
See Algorithm Overview for a full comparison.