Fixing ValueError: max seq len is larger than KV cache capacity in vLLM

intermediate🧠 AI Tools2026-05-22| Linux (Ubuntu 20.04+), Python 3.9+, vLLM 0.3.x–0.5.x, NVIDIA GPU (CUDA 11.8+)

Error Message

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache. Try increasing gpu_memory_utilization or decreasing max_model_len.

#vllm#llm#gpu#kv-cache

What happened

You started vLLM to serve a model — Llama 3, Mistral, Qwen, take your pick — and it died before handling a single request:

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache. Try increasing gpu_memory_utilization or decreasing max_model_len.

This crashes at startup, not mid-inference. vLLM measures how many tokens fit in the KV cache from your GPU's free memory, then compares that against max_model_len. Can't fit even one full sequence? It won't start.

Why this happens

vLLM pre-allocates GPU memory for the KV cache during initialization. Four things control how much room is left:

Total GPU VRAM — the hard ceiling
gpu_memory_utilization — fraction of VRAM vLLM may use (default: 0.90)
Model weights — already loaded before the cache gets sized
max_model_len — maximum sequence length the model supports

Weights load first. Whatever VRAM remains (multiplied by gpu_memory_utilization) goes to the KV cache. If that remainder can't hold even one sequence of max_model_len tokens, you get this error.

The classic scenario: a 32K or 128K context model on a 24 GB GPU that already consumed 18+ GB for weights. Another common culprit — a stale Jupyter kernel or leftover CUDA context eating VRAM before vLLM even starts.

Quick fix — reduce max_model_len

Most workloads never touch 32K tokens anyway. Cap the context and move on:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --max-model-len 8192

Or with the Python API:

from vllm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    max_model_len=8192
)

Start at 4096 or 8192 — enough for most chat applications. Once you know what actually fits, you can push it higher.

Alternative fix — increase gpu_memory_utilization

Need longer context without shrinking max_model_len? Squeeze more out of your GPU:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.95

Don't push past 0.95. Inference needs memory for activations too — go higher and you'll hit OOM errors mid-request, which is worse than a clean startup failure. Treat 0.95 as the practical ceiling.

Permanent fix — free up VRAM before starting

Run nvidia-smi and see what's squatting on your GPU before launching vLLM:

nvidia-smi

Spot a process holding VRAM (a Jupyter kernel, another model server, a zombie CUDA context)? Kill it first:

# Find the PID from nvidia-smi output
kill -9 <PID>

# Or clear all Python processes on the GPU (use carefully)
fuser -k /dev/nvidia*

Restart vLLM on a clean GPU. The default gpu_memory_utilization=0.90 is usually enough once nothing else is competing for memory.

Multi-GPU option

Spread the load across GPUs with tensor parallelism:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768

Two GPUs roughly doubles the VRAM available for the KV cache. That makes running the full 32K context on a 70B model feasible where a single GPU would choke.

How to find a safe max_model_len for your setup

No formula works universally — binary search is faster than guessing. Start at 4096 and double until it breaks:

# Test with increasing context lengths
for LEN in 4096 8192 16384 32768; do
  echo "Testing max_model_len=$LEN"
  python -c "
from vllm import LLM
try:
    llm = LLM('meta-llama/Meta-Llama-3-8B-Instruct', max_model_len=$LEN)
    print('OK: $LEN works')
except ValueError as e:
    print('FAIL:', e)
" 2>&1 | grep -E 'OK|FAIL'
done

The highest value that prints OK is your ceiling for that GPU/model combination. Write it down — it changes if you load a different model or add another process to the machine.

Verify the fix

A clean startup logs the KV cache allocation — look for the GPU blocks line:

INFO:     # GPU blocks: 1234, # CPU blocks: 512
INFO:     Avg prompt throughput: 0.0 tokens/s, ...
INFO:     Application startup complete.

Each block holds 16 tokens by default. So 1234 blocks × 16 = 19,744 tokens of real capacity. That number must be ≥ your max_model_len, or you'd never have gotten past startup.

Confirm inference actually works end-to-end:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "prompt": "Hello, world!",
    "max_tokens": 50
  }'

Summary

Fastest fix: add --max-model-len 8192 (or lower) to your startup command
Need more context? Try --gpu-memory-utilization 0.95 first, then clear VRAM
Large models on tight VRAM: use --tensor-parallel-size 2 across multiple GPUs
Always run nvidia-smi before starting vLLM — stale processes waste your time