5-Minute Quickstart

This guide gets you from a fresh install to compressed LLM inference in five minutes. You will load a model with mlx_lm, attach a TurboQuant RVQ KV cache, generate text, and print memory statistics.

Prerequisites

Complete Installation first. You need mlx_lm installed (pip install mlx-lm) and a model downloaded locally (e.g. mlx-community/Llama-3.2-3B-Instruct-4bit).

Step 1 — Load a model

import mlx_lm

model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")

Step 2 — Create a compressed KV cache

from veloxquant_mlx.cache.base import KVCacheConfig, KVCacheBuilder

config = KVCacheConfig(
    method="turboquant_rvq",   # zero-calibration 1-bit RVQ
    bits=1,                    # 1-bit keys, 2-bit values (default)
)

# Build per-layer cache matching the model architecture
cache = KVCacheBuilder.build(model, config)

Step 3 — Generate with compression

prompt = "Explain the key-value cache in large language models in simple terms."

response = mlx_lm.generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=512,
    kv_cache=cache,            # drop-in replacement for the default cache
    verbose=True,
)

print(response)

Step 4 — Inspect memory savings

from veloxquant_mlx.observers.memory import MemoryObserver

observer = MemoryObserver()
observer.attach(cache)

# Run a longer generation to see the savings
response = mlx_lm.generate(
    model, tokenizer, prompt=prompt, max_tokens=2048, kv_cache=cache
)

report = observer.report()
print(f"Peak compressed memory : {report.peak_compressed_mb:.1f} MB")
print(f"Equivalent fp16 memory : {report.peak_fp16_mb:.1f} MB")
print(f"Compression ratio      : {report.compression_ratio:.1f}×")

Example output on M3 Pro (Llama-3.2-3B, 2048 tokens):

Peak compressed memory : 48.3 MB
Equivalent fp16 memory : 362.0 MB
Compression ratio      : 7.5×

Full script

import mlx_lm
from veloxquant_mlx.cache.base import KVCacheConfig, KVCacheBuilder
from veloxquant_mlx.observers.memory import MemoryObserver

# Load model
model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")

# Configure compressed cache
config = KVCacheConfig(method="turboquant_rvq", bits=1)
cache = KVCacheBuilder.build(model, config)

# Attach memory observer
observer = MemoryObserver()
observer.attach(cache)

# Generate
prompt = "Write a short story about a robot learning to paint."
response = mlx_lm.generate(
    model, tokenizer, prompt=prompt, max_tokens=1024, kv_cache=cache
)
print(response)

# Print stats
report = observer.report()
print(f"\nMemory: {report.peak_compressed_mb:.1f} MB "
      f"(vs {report.peak_fp16_mb:.1f} MB fp16, "
      f"{report.compression_ratio:.1f}× compression)")

What just happened?

KVCacheConfig describes which algorithm and bit-width to use
KVCacheBuilder.build() creates one cache per transformer layer, matching the model's num_key_value_heads and head_dim
During generation, each attention layer writes compressed keys/values via Metal GPU kernels instead of storing raw fp16 tensors
The MemoryObserver tracks peak allocation and reports the savings

Try a stronger algorithm

For higher accuracy at a slightly higher compute cost, switch to VecInfer (requires a one-time codebook training step):

from veloxquant_mlx.allocators.vecinfer import train_codebook, calibrate_smooth_factors

# One-time calibration (save and reuse across sessions)
smooth_factors = calibrate_smooth_factors(model, tokenizer, num_samples=64)
codebook = train_codebook(model, tokenizer, smooth_factors, num_samples=128)

config = KVCacheConfig(
    method="vecinfer",
    bits=2,
    codebook=codebook,
    smooth_factors=smooth_factors,
)
cache = KVCacheBuilder.build(model, config)

See VecInfer algorithm docs and the mlx_lm integration guide for full details.

Step 1 — Load a model​

Step 2 — Create a compressed KV cache​

Step 3 — Generate with compression​

Step 4 — Inspect memory savings​

Full script​

What just happened?​

Try a stronger algorithm​

Next steps​