Core Abstractions API
veloxquant_mlx.core
Abstract base classes
veloxquant_mlx.core.abstractions
All concrete implementations subclass these ABCs. You should program to these interfaces when building custom integrations.
Quantizer
from veloxquant_mlx.core.abstractions import Quantizer
class Quantizer(ABC):
@abstractmethod
def encode(self, x: mx.array) -> EncodedVector: ...
@abstractmethod
def decode(self, encoded: EncodedVector) -> mx.array: ...
def encode_values(self, x: mx.array) -> EncodedVector:
return self.encode(x)
def decode_values(self, encoded: EncodedVector) -> mx.array:
return self.decode(encoded)
All quantizers implement encode and decode with these signatures:
encode(x)— input shape[batch, heads, seq, head_dim], returnsEncodedVectordecode(encoded)— returnsmx.arrayof shape[batch, heads, seq, head_dim]
KVCache
from veloxquant_mlx.core.abstractions import KVCache
class KVCache(ABC):
@abstractmethod
def update(self, keys: mx.array, values: mx.array) -> tuple[mx.array, mx.array]: ...
@property
@abstractmethod
def state(self) -> tuple[mx.array, mx.array]: ...
update(keys, values) is called once per generation step. It writes the new keys/values to the compressed cache and returns the full (dequantized) cache for attention computation.
Preconditioner
from veloxquant_mlx.core.abstractions import Preconditioner
Linear transforms applied before quantization.
class Preconditioner(ABC):
@abstractmethod
def apply(self, x: mx.array) -> mx.array: ...
@abstractmethod
def inverse(self, x: mx.array) -> mx.array: ...
Codebook
from veloxquant_mlx.core.abstractions import Codebook
class Codebook(ABC):
@abstractmethod
def quantize(self, x: mx.array) -> mx.array: ... # returns indices
@abstractmethod
def dequantize(self, indices: mx.array) -> mx.array: ...
Context types
veloxquant_mlx.core.context
EncodedVector
from veloxquant_mlx.core.context import EncodedVector
@dataclass
class EncodedVector:
indices: mx.array # packed integer codes
scale: mx.array | None # per-channel or per-block scale
metadata: dict # algorithm-specific extra data
original_shape: tuple[int, ...]
dtype: mx.Dtype # original dtype (usually mx.float16)
EncodedVector is the currency passed between encode() and decode(). Different algorithms store different things in metadata (e.g., cluster IDs for RaBitQ, rotation info for SpectralQuant).
QuantizationContext
from veloxquant_mlx.core.context import QuantizationContext
Request-scoped context passed through a quantization pipeline.
@dataclass
class QuantizationContext:
layer_name: str
step: int # generation step (0-indexed)
config: KVCacheConfig
artifacts: ArtifactStore
TransformResult
from veloxquant_mlx.core.context import TransformResult
Output of a Preconditioner.apply() call, including pre-transform metadata needed for the inverse.
Registry
from veloxquant_mlx.core.registry import QuantizerRegistry
Plugin registry for quantizer discovery.
# Register a custom quantizer
@QuantizerRegistry.register("my_quantizer")
class MyQuantizer(Quantizer):
def encode(self, x): ...
def decode(self, encoded): ...
# Create by name
q = QuantizerFactory.create("my_quantizer", bits=2)
CLI reference
The veloxquant_mlx package exposes two CLI commands via python -m veloxquant_mlx:
precompute
python -m veloxquant_mlx precompute \
--method {vecinfer,spectral,ratequant} \
--model MODEL_PATH \
--output ARTIFACT_DIR \
[--num-samples N] \
[--sequence-length L] \
[--target-bits BITS]
Runs calibration for the specified method and saves artifacts.
benchmark
python -m veloxquant_mlx benchmark \
--model MODEL_PATH \
--method METHOD \
--bits BITS \
[--value-bits BITS] \
[--seq-len SEQ] \
[--num-runs N] \
[--output JSON_PATH]
Benchmarks a configuration and prints/saves metrics. See Benchmarking guide.