What happened
You started vLLM to serve a model β Llama 3, Mistral, Qwen, take your pick β and it died before handling a single request:
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache. Try increasing gpu_memory_utilization or decreasing max_model_len.
This crashes at startup, not mid-inference. vLLM measures how many tokens fit in the KV cache from your GPU's free memory, then compares that against max_model_len. Can't fit even one full sequence? It won't start.
Why this happens
vLLM pre-allocates GPU memory for the KV cache during initialization. Four things control how much room is left:
- Total GPU VRAM β the hard ceiling
- gpu_memory_utilization β fraction of VRAM vLLM may use (default: 0.90)
- Model weights β already loaded before the cache gets sized
- max_model_len β maximum sequence length the model supports
Weights load first. Whatever VRAM remains (multiplied by gpu_memory_utilization) goes to the KV cache. If that remainder can't hold even one sequence of max_model_len tokens, you get this error.
The classic scenario: a 32K or 128K context model on a 24 GB GPU that already consumed 18+ GB for weights. Another common culprit β a stale Jupyter kernel or leftover CUDA context eating VRAM before vLLM even starts.
Quick fix β reduce max_model_len
Most workloads never touch 32K tokens anyway. Cap the context and move on:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--max-model-len 8192
Or with the Python API:
from vllm import LLM
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
max_model_len=8192
)
Start at 4096 or 8192 β enough for most chat applications. Once you know what actually fits, you can push it higher.
Alternative fix β increase gpu_memory_utilization
Need longer context without shrinking max_model_len? Squeeze more out of your GPU:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--gpu-memory-utilization 0.95
Don't push past 0.95. Inference needs memory for activations too β go higher and you'll hit OOM errors mid-request, which is worse than a clean startup failure. Treat 0.95 as the practical ceiling.
Permanent fix β free up VRAM before starting
Run nvidia-smi and see what's squatting on your GPU before launching vLLM:
nvidia-smi
Spot a process holding VRAM (a Jupyter kernel, another model server, a zombie CUDA context)? Kill it first:
# Find the PID from nvidia-smi output
kill -9 <PID>
# Or clear all Python processes on the GPU (use carefully)
fuser -k /dev/nvidia*
Restart vLLM on a clean GPU. The default gpu_memory_utilization=0.90 is usually enough once nothing else is competing for memory.
Multi-GPU option
Spread the load across GPUs with tensor parallelism:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768
Two GPUs roughly doubles the VRAM available for the KV cache. That makes running the full 32K context on a 70B model feasible where a single GPU would choke.
How to find a safe max_model_len for your setup
No formula works universally β binary search is faster than guessing. Start at 4096 and double until it breaks:
# Test with increasing context lengths
for LEN in 4096 8192 16384 32768; do
echo "Testing max_model_len=$LEN"
python -c "
from vllm import LLM
try:
llm = LLM('meta-llama/Meta-Llama-3-8B-Instruct', max_model_len=$LEN)
print('OK: $LEN works')
except ValueError as e:
print('FAIL:', e)
" 2>&1 | grep -E 'OK|FAIL'
done
The highest value that prints OK is your ceiling for that GPU/model combination. Write it down β it changes if you load a different model or add another process to the machine.
Verify the fix
A clean startup logs the KV cache allocation β look for the GPU blocks line:
INFO: # GPU blocks: 1234, # CPU blocks: 512
INFO: Avg prompt throughput: 0.0 tokens/s, ...
INFO: Application startup complete.
Each block holds 16 tokens by default. So 1234 blocks Γ 16 = 19,744 tokens of real capacity. That number must be β₯ your max_model_len, or you'd never have gotten past startup.
Confirm inference actually works end-to-end:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"prompt": "Hello, world!",
"max_tokens": 50
}'
Summary
- Fastest fix: add
--max-model-len 8192(or lower) to your startup command - Need more context? Try
--gpu-memory-utilization 0.95first, then clear VRAM - Large models on tight VRAM: use
--tensor-parallel-size 2across multiple GPUs - Always run
nvidia-smibefore starting vLLM β stale processes waste your time

