The scenario
You're running Whisper on a long audio file β or just loading a large model β and it crashes mid-transcription:
RuntimeError: CUDA error: out of memory
Exception raised from malloc at ../aten/src/ATen/native/cuda/CachingHostAllocator.cpp:152
Sometimes it dies on load. Sometimes it runs for 30 seconds, then bails. Either way, the GPU ran out of VRAM trying to allocate tensors for the model or audio chunks.
Why this happens
Whisper loads the entire model into VRAM upfront. The memory requirements vary a lot by model size:
tinyβ ~1 GB VRAMbaseβ ~1 GB VRAMsmallβ ~2 GB VRAMmediumβ ~5 GB VRAMlarge/large-v2/large-v3β ~10 GB VRAM
On top of the model weights, Whisper allocates extra buffers for audio processing. A browser tab with GPU acceleration, another training script running in the background, or even a game can quietly eat 1β3 GB of VRAM before your script even starts. That's often enough to push things over the edge.
Start here β check what's actually using your GPU
Don't touch your code yet. First, see the real VRAM picture:
nvidia-smi
Check the Memory-Usage column. If something else is hogging VRAM, kill it and retry β that might be the whole problem.
Also worth checking for zombie Python processes that didn't release the GPU properly:
nvidia-smi | grep python
kill -9 <PID>
Fix 1 β Drop to a smaller model
The simplest fix. Running large on a 6 GB card? Drop one size:
import whisper
model = whisper.load_model("medium") # or "small" if still OOM
result = model.transcribe("audio.mp3")
print(result["text"])
medium hits a strong accuracy-to-VRAM ratio for most tasks. For English-only audio, small performs surprisingly well β often indistinguishable from medium on clear recordings.
Fix 2 β Fall back to CPU
No VRAM, no problem. Slower, but it will always finish:
import whisper
model = whisper.load_model("large-v3", device="cpu")
result = model.transcribe("audio.mp3")
print(result["text"])
Expect CPU transcription to run 5β20x slower than GPU. A 10-minute audio file that takes 30 seconds on a GPU might take 8β10 minutes on CPU. Worth it for one-off jobs where you specifically need large accuracy and can't change the GPU situation.
Fix 3 β Clear cached GPU memory before loading
Running Whisper inside a larger script that already has PyTorch tensors in memory? Clear the allocator cache first:
import torch
import whisper
torch.cuda.empty_cache()
model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3")
del model
torch.cuda.empty_cache()
One caveat: empty_cache() only releases the PyTorch allocator's free blocks. It won't reclaim memory held by live tensors. But if a previous model left fragmented cache behind, this often frees enough headroom to load Whisper.
Fix 4 β Enable fp16 to cut VRAM usage in half
Half-precision inference is on by default for CUDA, but making it explicit doesn't hurt:
import whisper
model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3", fp16=True)
print(result["text"])
This nearly halves VRAM usage with minimal accuracy impact. One gotcha: if you're running on CPU, set fp16=False. CPU doesn't support fp16 and you'll get a noisy warning that buries the actual output.
Fix 5 β Switch to faster-whisper (the practical solution for large models)
faster-whisper is a CTranslate2-based reimplementation of Whisper. Same models, fraction of the VRAM:
pip install faster-whisper
from faster_whisper import WhisperModel
# int8 quantization β lowest VRAM, still solid quality
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
With compute_type="int8", large-v3 drops from ~10 GB to around 3β4 GB VRAM. That's the difference between OOM and a smooth run on a 6 GB card.
Your compute_type options:
float16β default GPU mode, ~half the fp32 footprintint8_float16β good balance of speed and accuracyint8β lowest VRAM, slight accuracy trade-off (usually unnoticeable)
Fix 6 β Chunk long audio files
Very long recordings cause peak allocations that tip you over the limit even when the model loads fine. Splitting into 10-minute chunks keeps memory usage flat:
from pydub import AudioSegment
import whisper
import os
model = whisper.load_model("medium")
audio = AudioSegment.from_file("long_audio.mp3")
chunk_length_ms = 10 * 60 * 1000 # 10 minutes
chunks = [audio[i:i+chunk_length_ms] for i in range(0, len(audio), chunk_length_ms)]
full_text = []
for i, chunk in enumerate(chunks):
chunk_path = f"/tmp/chunk_{i}.mp3"
chunk.export(chunk_path, format="mp3")
result = model.transcribe(chunk_path)
full_text.append(result["text"])
os.remove(chunk_path)
print(" ".join(full_text))
Verify it's actually fixed
Watch VRAM in real time while Whisper runs:
# In a separate terminal
watch -n 1 nvidia-smi
You want to see VRAM climb when the model loads, then hold steady during transcription. A spike-then-crash pattern means you're still over the limit.
After a successful run, check peak usage from Python:
import torch
print(f"Peak VRAM: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
print(f"Currently allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
What I actually use
On an 8 GB card, my default is faster-whisper with large-v3 + int8. It fits comfortably, runs fast, and the quality is nearly identical to the original large model. I only fall back to the original whisper library when a specific integration requires it.
Stuck with the original library and a small GPU? medium + fp16=True is the sweet spot.

