Skip to main content

Sliding Window Cache

SlidingWindowKVCache wraps any VeloxQuant-MLX cache with a token eviction policy. When the sequence length exceeds the window size, old tokens are evicted from the cache, keeping memory bounded regardless of generation length.

Why use a sliding window?

Standard KV caches grow linearly with sequence length. Even with compression, a 32k-token conversation can exhaust memory on an M2 MacBook. The sliding window bounds the cache size at the cost of losing access to tokens outside the window.

This is useful for:

  • Very long conversations or documents
  • Streaming generation where you only need recent context
  • Memory-constrained devices (M1 8 GB, M2 8 GB)

Basic usage

import mlx_lm
from veloxquant_mlx.cache.base import KVCacheConfig, KVCacheBuilder
from veloxquant_mlx.cache.sliding_window_cache import SlidingWindowKVCache

model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-3B-Instruct-4bit")

# Create the inner compressed cache
config = KVCacheConfig(method="turboquant_rvq", bits=1)
inner_cache = KVCacheBuilder.build(model, config)

# Wrap it with a sliding window of 2048 tokens
cache = SlidingWindowKVCache(
inner_cache=inner_cache,
window_size=2048,
eviction_strategy="fifo", # oldest tokens evicted first
)

# Use exactly like a normal cache
response = mlx_lm.generate(
model, tokenizer,
prompt="Tell me a very long story...",
max_tokens=8192, # can exceed window_size — old tokens evicted automatically
kv_cache=cache,
)

Eviction strategies

StrategyDescriptionBest for
"fifo"Evict oldest tokens firstGeneral use — maintains recency
"attention"Evict lowest-attention-weight tokensPreserves semantically important tokens
"fixed"Keep first N + last M tokensLong documents with important prefix
# Attention-based eviction (requires attention score tracking)
cache = SlidingWindowKVCache(
inner_cache=inner_cache,
window_size=2048,
eviction_strategy="attention",
eviction_attention_window=64, # score tokens over last 64 steps
)

# Fixed: keep first 128 (system prompt) + last 1920 (recent context)
cache = SlidingWindowKVCache(
inner_cache=inner_cache,
window_size=2048,
eviction_strategy="fixed",
fixed_prefix_size=128,
)

Combining with any algorithm

SlidingWindowKVCache wraps any inner cache — including calibrated algorithms:

from veloxquant_mlx.cache.vecinfer_cache import VecInferKVCache
from veloxquant_mlx.artifacts.npy_store import NpyArtifactStore

store = NpyArtifactStore("./artifacts/")
inner_cache = VecInferKVCache(
model=model,
codebook=store.load("vecinfer_codebook"),
smooth_factors=store.load("vecinfer_smooth"),
)

cache = SlidingWindowKVCache(
inner_cache=inner_cache,
window_size=4096,
eviction_strategy="attention",
)

Monitoring evictions

cache = SlidingWindowKVCache(inner_cache=inner_cache, window_size=2048)

response = mlx_lm.generate(model, tokenizer, prompt=prompt, max_tokens=4096, kv_cache=cache)

stats = cache.eviction_stats()
print(f"Total evictions: {stats.total_evictions}")
print(f"Current window usage: {stats.current_size} / {cache.window_size}")
print(f"Tokens evicted: {stats.tokens_evicted}")

Configuration reference

ParameterTypeDefaultDescription
inner_cacheKVCacheRequiredAny VeloxQuant-MLX cache
window_sizeintRequiredMaximum tokens to keep in cache
eviction_strategystr"fifo""fifo", "attention", or "fixed"
fixed_prefix_sizeint0Tokens to always keep at start (for "fixed" strategy)
eviction_attention_windowint32Steps to score attention over (for "attention" strategy)

See also