Skip to main content

What is VeloxQuant-MLX?

VeloxQuant-MLX is a production-grade KV cache compression library for Apple Silicon (M-series Macs). It implements nine quantization algorithms that compress the key-value cache used during LLM inference — reducing peak memory by up to 98% while maintaining near-lossless output quality.

LLMs like Llama, Mistral, and Qwen store past context in a KV cache that grows linearly with sequence length. On a MacBook M3 Pro with 18 GB unified memory, a 7B model at 8k context can consume 14 GB of cache alone — leaving almost no room for anything else. VeloxQuant-MLX compresses that cache on-the-fly with Metal GPU kernels, making long-context inference practical on consumer hardware.

Why Apple Silicon?

Apple's M-series chips have a unique advantage: unified memory. The GPU and CPU share the same memory pool, which means there is no PCIe bandwidth bottleneck between host and device. VeloxQuant-MLX is built specifically around this architecture:

  • Metal GPU kernels run quantization/dequantization directly on the Neural Engine and GPU cores
  • MLX — Apple's ML framework — provides the tensor primitives; VeloxQuant-MLX sits on top of it
  • Quantized KV cache stays in unified memory, accessed by both the attention kernel and the quantizer with zero copies

Key metrics

MetricValue
Max key cache compression16× (VecInfer 1-bit)
Metal kernel speedup13× faster quantization
Peak memory reductionup to 98%
RVQ-1bit compression7.5× with zero calibration
RaBitQ full KV6× (keys + values)
Validated models12 (Llama, Mistral, Qwen, Phi, Gemma, Falcon)
Test suite212+ passing tests

Algorithm overview

VeloxQuant-MLX provides nine algorithms ranging from zero-calibration 1-bit methods to sophisticated mixed-precision allocators:

AlgorithmBitsCalibrationBest for
TurboQuant RVQ1–3+NoneGeneral purpose, drop-in replacement
VecInfer1–4Codebook trainingMaximum throughput
RateQuantmixed90 secondsMixed-precision accuracy-memory tradeoffs
SpectralQuant2–8SVD rotationHigh-accuracy long context
RaBitQ1NoneKey-only extreme compression
QJL1NoneSimplest, fastest
PolarQuant1–2NoneGeometric key distributions
CommVQ2–4NoneRoPE-compatible residual VQ

See Algorithm Overview for a full comparison.

Next steps