5-Minute Quickstart
This guide gets you from a fresh install to compressed LLM inference in five minutes. You will load a model with mlx_lm, attach a TurboQuant RVQ KV cache, generate text, and print memory statistics.
:::note Prerequisites
Complete Installation first. You need mlx_lm installed (pip install mlx-lm) and a model downloaded locally (e.g. mlx-community/Llama-3.2-3B-Instruct-4bit).
:::
Step 1 — Load a model
import mlx_lm
model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")
Step 2 — Create a compressed KV cache
from veloxquant_mlx.cache.base import KVCacheConfig, KVCacheBuilder
config = KVCacheConfig(
method="turboquant_rvq", # zero-calibration 1-bit RVQ
bits=1, # 1-bit keys, 2-bit values (default)
)
# Build per-layer cache matching the model architecture
cache = KVCacheBuilder.build(model, config)
Step 3 — Generate with compression
prompt = "Explain the key-value cache in large language models in simple terms."
response = mlx_lm.generate(
model,
tokenizer,
prompt=prompt,
max_tokens=512,
kv_cache=cache, # drop-in replacement for the default cache
verbose=True,
)
print(response)
Step 4 — Inspect memory savings
from veloxquant_mlx.observers.memory import MemoryObserver
observer = MemoryObserver()
observer.attach(cache)
# Run a longer generation to see the savings
response = mlx_lm.generate(
model, tokenizer, prompt=prompt, max_tokens=2048, kv_cache=cache
)
report = observer.report()
print(f"Peak compressed memory : {report.peak_compressed_mb:.1f} MB")
print(f"Equivalent fp16 memory : {report.peak_fp16_mb:.1f} MB")
print(f"Compression ratio : {report.compression_ratio:.1f}×")
Example output on M3 Pro (Llama-3.2-3B, 2048 tokens):
Peak compressed memory : 48.3 MB
Equivalent fp16 memory : 362.0 MB
Compression ratio : 7.5×
Full script
import mlx_lm
from veloxquant_mlx.cache.base import KVCacheConfig, KVCacheBuilder
from veloxquant_mlx.observers.memory import MemoryObserver
# Load model
model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")
# Configure compressed cache
config = KVCacheConfig(method="turboquant_rvq", bits=1)
cache = KVCacheBuilder.build(model, config)
# Attach memory observer
observer = MemoryObserver()
observer.attach(cache)
# Generate
prompt = "Write a short story about a robot learning to paint."
response = mlx_lm.generate(
model, tokenizer, prompt=prompt, max_tokens=1024, kv_cache=cache
)
print(response)
# Print stats
report = observer.report()
print(f"\nMemory: {report.peak_compressed_mb:.1f} MB "
f"(vs {report.peak_fp16_mb:.1f} MB fp16, "
f"{report.compression_ratio:.1f}× compression)")
What just happened?
KVCacheConfigdescribes which algorithm and bit-width to useKVCacheBuilder.build()creates one cache per transformer layer, matching the model'snum_key_value_headsandhead_dim- During generation, each attention layer writes compressed keys/values via Metal GPU kernels instead of storing raw fp16 tensors
- The
MemoryObservertracks peak allocation and reports the savings
Try a stronger algorithm
For higher accuracy at a slightly higher compute cost, switch to VecInfer (requires a one-time codebook training step):
from veloxquant_mlx.allocators.vecinfer import train_codebook, calibrate_smooth_factors
# One-time calibration (save and reuse across sessions)
smooth_factors = calibrate_smooth_factors(model, tokenizer, num_samples=64)
codebook = train_codebook(model, tokenizer, smooth_factors, num_samples=128)
config = KVCacheConfig(
method="vecinfer",
bits=2,
codebook=codebook,
smooth_factors=smooth_factors,
)
cache = KVCacheBuilder.build(model, config)
See VecInfer algorithm docs and the mlx_lm integration guide for full details.