Fix Whisper RuntimeError: CUDA error: out of memory

The scenario

You're running Whisper on a long audio file — or just loading a large model — and it crashes mid-transcription:

RuntimeError: CUDA error: out of memory
Exception raised from malloc at ../aten/src/ATen/native/cuda/CachingHostAllocator.cpp:152

Sometimes it dies on load. Sometimes it runs for 30 seconds, then bails. Either way, the GPU ran out of VRAM trying to allocate tensors for the model or audio chunks.

Why this happens

Whisper loads the entire model into VRAM upfront. The memory requirements vary a lot by model size:

tiny — ~1 GB VRAM
base — ~1 GB VRAM
small — ~2 GB VRAM
medium — ~5 GB VRAM
large / large-v2 / large-v3 — ~10 GB VRAM

On top of the model weights, Whisper allocates extra buffers for audio processing. A browser tab with GPU acceleration, another training script running in the background, or even a game can quietly eat 1–3 GB of VRAM before your script even starts. That's often enough to push things over the edge.

Start here — check what's actually using your GPU

Don't touch your code yet. First, see the real VRAM picture:

nvidia-smi

Check the Memory-Usage column. If something else is hogging VRAM, kill it and retry — that might be the whole problem.

Also worth checking for zombie Python processes that didn't release the GPU properly:

nvidia-smi | grep python

kill -9 <PID>

Fix 1 — Drop to a smaller model

The simplest fix. Running large on a 6 GB card? Drop one size:

import whisper

model = whisper.load_model("medium")  # or "small" if still OOM
result = model.transcribe("audio.mp3")
print(result["text"])

medium hits a strong accuracy-to-VRAM ratio for most tasks. For English-only audio, small performs surprisingly well — often indistinguishable from medium on clear recordings.

Fix 2 — Fall back to CPU

No VRAM, no problem. Slower, but it will always finish:

import whisper

model = whisper.load_model("large-v3", device="cpu")
result = model.transcribe("audio.mp3")
print(result["text"])

Expect CPU transcription to run 5–20x slower than GPU. A 10-minute audio file that takes 30 seconds on a GPU might take 8–10 minutes on CPU. Worth it for one-off jobs where you specifically need large accuracy and can't change the GPU situation.

Fix 3 — Clear cached GPU memory before loading

Running Whisper inside a larger script that already has PyTorch tensors in memory? Clear the allocator cache first:

import torch
import whisper

torch.cuda.empty_cache()

model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3")

del model
torch.cuda.empty_cache()

One caveat: empty_cache() only releases the PyTorch allocator's free blocks. It won't reclaim memory held by live tensors. But if a previous model left fragmented cache behind, this often frees enough headroom to load Whisper.

Fix 4 — Enable fp16 to cut VRAM usage in half

Half-precision inference is on by default for CUDA, but making it explicit doesn't hurt:

import whisper

model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3", fp16=True)
print(result["text"])

This nearly halves VRAM usage with minimal accuracy impact. One gotcha: if you're running on CPU, set fp16=False. CPU doesn't support fp16 and you'll get a noisy warning that buries the actual output.

Fix 5 — Switch to faster-whisper (the practical solution for large models)

faster-whisper is a CTranslate2-based reimplementation of Whisper. Same models, fraction of the VRAM:

pip install faster-whisper

from faster_whisper import WhisperModel

# int8 quantization — lowest VRAM, still solid quality
model = WhisperModel("large-v3", device="cuda", compute_type="int8")

segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

With compute_type="int8", large-v3 drops from ~10 GB to around 3–4 GB VRAM. That's the difference between OOM and a smooth run on a 6 GB card.

Your compute_type options:

float16 — default GPU mode, ~half the fp32 footprint
int8_float16 — good balance of speed and accuracy
int8 — lowest VRAM, slight accuracy trade-off (usually unnoticeable)

Fix 6 — Chunk long audio files

Very long recordings cause peak allocations that tip you over the limit even when the model loads fine. Splitting into 10-minute chunks keeps memory usage flat:

from pydub import AudioSegment
import whisper
import os

model = whisper.load_model("medium")
audio = AudioSegment.from_file("long_audio.mp3")

chunk_length_ms = 10 * 60 * 1000  # 10 minutes
chunks = [audio[i:i+chunk_length_ms] for i in range(0, len(audio), chunk_length_ms)]

full_text = []
for i, chunk in enumerate(chunks):
    chunk_path = f"/tmp/chunk_{i}.mp3"
    chunk.export(chunk_path, format="mp3")
    result = model.transcribe(chunk_path)
    full_text.append(result["text"])
    os.remove(chunk_path)

print(" ".join(full_text))

Verify it's actually fixed

Watch VRAM in real time while Whisper runs:

# In a separate terminal
watch -n 1 nvidia-smi

You want to see VRAM climb when the model loads, then hold steady during transcription. A spike-then-crash pattern means you're still over the limit.

After a successful run, check peak usage from Python:

import torch
print(f"Peak VRAM: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
print(f"Currently allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

What I actually use

On an 8 GB card, my default is faster-whisper with large-v3 + int8. It fits comfortably, runs fast, and the quality is nearly identical to the original large model. I only fall back to the original whisper library when a specific integration requires it.

Stuck with the original library and a small GPU? medium + fp16=True is the sweet spot.