Skip to main content

Cache API

veloxquant_mlx.cache

The cache module provides the configuration system, factory, builder, and all KV cache implementations.


KVCacheConfig

from veloxquant_mlx.cache.base import KVCacheConfig

Dataclass that describes a quantization configuration.

@dataclass
class KVCacheConfig:
method: str
bits: int = 1
value_bits: int = 2
num_residuals: int = 2
use_hadamard: bool = True
codebook: ndarray | None = None
smooth_factors: ndarray | None = None
rotations: list | None = None
bit_allocation: dict[str, int] | None = None
outlier_observer: KeyNormObserver | None = None
outlier_bits: int = 8
sketch_dim: int = 64
num_clusters: int = 64
num_subspaces: int | None = None
use_fused_sdpa: bool = True
signal_bits: int = 4
noise_bits: int = 1
seed: int = 0

Parameters

ParameterTypeDefaultDescription
methodstrRequiredAlgorithm name: "turboquant_rvq", "vecinfer", "ratequant", "spectral", "rabitq", "qjl", "polarquant", "commvq"
bitsint1Key bit rate
value_bitsint2Value bit rate. 16 = fp16 (no compression)
num_residualsint2RVQ residual passes (TurboQuant RVQ only)
use_hadamardboolTrueApply Walsh-Hadamard before quantization
codebookndarrayNoneTrained product codebook (VecInfer required)
smooth_factorsndarrayNonePer-channel scaling (VecInfer required)
rotationslistNoneSVD rotations (SpectralQuant required)
bit_allocationdictNonePer-layer bit map (RateQuant)
sketch_dimint64JL sketch dimension (QJL)
num_clustersint64IVF clusters (RaBitQ)
signal_bitsint4Bits for signal dimensions (SpectralQuant)
noise_bitsint1Bits for noise dimensions (SpectralQuant)

KVCacheFactory

from veloxquant_mlx.cache.base import KVCacheFactory

Factory that maps a KVCacheConfig to a concrete KVCache instance.

KVCacheFactory.create

@staticmethod
def create(
config: KVCacheConfig,
num_heads: int,
head_dim: int,
max_seq_len: int = 8192,
) -> KVCache

Creates a single-layer KV cache.

Parameters:

ParameterTypeDescription
configKVCacheConfigQuantization configuration
num_headsintNumber of KV heads for this layer
head_dimintDimension per attention head
max_seq_lenintPre-allocated sequence length

Returns: A concrete KVCache subclass matching config.method.


KVCacheBuilder

from veloxquant_mlx.cache.base import KVCacheBuilder

High-level builder that inspects a model and creates per-layer caches automatically.

KVCacheBuilder.build

@staticmethod
def build(
model,
config: KVCacheConfig,
max_seq_len: int = 8192,
) -> list[KVCache]

Creates one KVCache per transformer layer, matching layer-specific head counts and head dims.

Parameters:

ParameterTypeDescription
modelmlx_lm modelModel loaded with mlx_lm.load()
configKVCacheConfigQuantization configuration
max_seq_lenintPre-allocated sequence length per cache

Returns: list[KVCache] — one per layer, pass directly to mlx_lm.generate(kv_cache=...).

Example:

import mlx_lm
from veloxquant_mlx.cache.base import KVCacheConfig, KVCacheBuilder

model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")
config = KVCacheConfig(method="turboquant_rvq", bits=1)
cache = KVCacheBuilder.build(model, config)
# cache is a list of 28 TurboQuantRVQKVCache instances (one per Llama layer)

Cache classes

TurboQuantRVQKVCache

from veloxquant_mlx.cache.turboquant_rvq_cache import TurboQuantRVQKVCache

KV cache backed by TurboQuant RVQ. Writes compressed keys/values on each attention step and provides dequantized tensors for attention computation.

VecInferKVCache

from veloxquant_mlx.cache.vecinfer_cache import VecInferKVCache

VecInfer cache with smooth scaling + product VQ. Requires pre-trained codebook and smooth factors.

SpectralQuantKVCache

from veloxquant_mlx.cache.spectral_cache import SpectralQuantKVCache

SpectralQuant cache. Requires per-layer rotation matrices from calibrate_spectral_rotation().

PolarQuantKVCache

from veloxquant_mlx.cache.polar_cache import PolarQuantKVCache

PolarQuant cache. Zero calibration; encodes keys as polar angles.

QJLKVCache

from veloxquant_mlx.cache.qjl_cache import QJLKVCache

QJL 1-bit sign sketch cache.

SlidingWindowKVCache

from veloxquant_mlx.cache.sliding_window_cache import SlidingWindowKVCache

Token eviction wrapper for any KVCache. See Sliding Window guide.


See also