Fixing 'CUDA Out of Memory' in Ollama: A Guide to Running Local LLMs on Modest Hardware

The Error Message

You’ve finally downloaded a powerful new model, but as soon as you try to run it, Ollama crashes. Instead of a chat prompt, you see a wall of text in your terminal ending with this:

CUDA out of memory. Tried to allocate [X] MiB (GPU 0; [Y] GiB total capacity; [Z] MiB already allocated; [W] MiB free; ...)

Why Your GPU is Screaming

Ollama tries to cram the entire Large Language Model (LLM) into your GPU's Video RAM (VRAM) to give you fast, near-instant responses. If the model files plus the "brain space" needed for the conversation (the context window) exceed your available VRAM, the CUDA driver simply gives up. This is a common bottleneck when trying to run a 70B model on an 8GB card, or when your browser is already hogging GPU resources.

Step 1: Identify VRAM Resource Hogs

Before you start tweaking Ollama, see what else is eating your memory. Open your terminal and run:

nvidia-smi

Check the "Processes" table at the bottom. Modern browsers like Chrome or Edge often use 500MB to 1.5GB of VRAM just to render tabs. Similarly, apps like Discord or Slack use hardware acceleration by default. Close these memory-heavy applications and try running your model again to see if that clears the bottleneck.

Step 2: Switch to a Leaner Quantization

Models come in different "weights" or compression levels called quantizations. A standard 8-bit (Q8_0) version of an 8B model might require 9GB of VRAM, while a 4-bit (Q4_K_M) version only needs about 5.5GB. Ollama pulls 4-bit models by default, but you can go even lower if you're desperate for space.

Try pulling a highly compressed version of your model:

# Try a 2-bit quantization for maximum memory savings
ollama run llama3:8b-instruct-q2_K

While a Q2 or Q3 model uses significantly less VRAM, keep in mind that the AI might become slightly more prone to errors or "hallucinations" because of the heavy compression.

Step 3: Shrink the Context Window

The context window is how much text the model can "remember" during a conversation. By default, Ollama often allocates enough VRAM for 2048 or 4096 tokens. On some models, like Llama 3, increasing this to 8192 tokens can eat up an extra 1-2GB of VRAM. Reducing this limit is the fastest way to stop OOM errors without changing the model itself.

Create a custom Modelfile to cap the memory usage:

Create a file named config.Modelfile:

FROM llama3
PARAMETER num_ctx 1024

Build your optimized model variation:

ollama create llama3-low-vram -f config.Modelfile

Run your new version:

ollama run llama3-low-vram

Step 4: Force Ollama to Clear Memory

Ollama keeps models loaded in your VRAM for 5 minutes after your last prompt. This makes subsequent chats start instantly, but it blocks other apps from using the GPU. If you are frequently switching between models, you should tell Ollama to release the GPU immediately or limit how many models it holds at once.

On Linux:

Modify the systemd service configuration:

sudo systemctl edit ollama.service

Insert these environment variables under the [Service] block:

Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_KEEP_ALIVE=0"

Apply the changes and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart ollama

On Windows:

Search for "Edit the system environment variables" in your Start menu.
Click "Environment Variables" and look under "User variables."
Add a New variable: OLLAMA_MAX_LOADED_MODELS with a value of 1.
Right-click the Ollama icon in your system tray and select "Quit," then relaunch it.

Step 5: Manually Offload Layers

If your model is just a few megabytes too large for your GPU, Ollama will usually try to split the workload between your GPU and CPU. If this automatic process fails, you can manually control how many model layers go to the GPU. This keeps the bulk of the work on your graphics card while offloading the excess to your system RAM.

Update your Modelfile with the num_gpu parameter:

FROM llama3
# Try sending only 20 layers to the GPU instead of the full model
PARAMETER num_gpu 20

Start with a low number and work your way up. It’s a balancing act: more GPU layers mean faster speeds, but fewer layers prevent the OOM crash.

How to Verify the Fix

Don't just guess if it's working. Follow these steps to see your memory usage in real-time:

Flush the memory by running ollama stop [model-name].
Open a new terminal window and run watch -n 1 nvidia-smi.
In your original terminal, start your model: ollama run llama3.
Watch the Memory-Usage column. If it stays safely below your GPU's total capacity (e.g., 7200MB used out of 8192MB), you’ve successfully optimized your setup.

Pro Tips for Better VRAM Management

Update Drivers: Ensure you are on NVIDIA Driver version 535 or higher. Newer drivers handle memory paging more gracefully.
Know Your Limits: If you have 8GB of VRAM, stick to 7B or 8B models. For 14B models, you really need 12GB+ of VRAM (like an RTX 3060 12GB) to run them comfortably.
Kill Ghost Processes: Sometimes Ollama doesn't shut down cleanly. Use pkill ollama on Linux or end the ollama.exe process in Task Manager to start with a completely empty GPU.