Fix 'bitsandbytes was compiled without GPU support' RuntimeError in Quantization

TL;DR Quick Fix

One cause, multiple exits: bitsandbytes couldn't find a CUDA-enabled GPU. Your next move depends on your setup:

GPU present but CUDA not detected → reinstall bitsandbytes with the matching CUDA wheel
CPU-only machine → skip 8-bit entirely and use GGUF/llama.cpp instead
Windows → install bitsandbytes-windows, or use WSL2 with CUDA passthrough

The Full Error

RuntimeError: bitsandbytes was compiled without GPU support. 8-bit optimizers and quantization require a GPU to function.

Typically triggered by code like this:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config
)

Root Cause

bitsandbytes links against CUDA kernels at load time. No CUDA runtime? No kernels. It's that simple.

Three situations lead here:

You installed torch from the default PyPI index — the CPU-only build — instead of the CUDA-specific index
Your machine has no NVIDIA GPU (cloud CPU instance, Mac, CI/CD server)
CUDA is installed system-wide, but nvcc or the CUDA runtime isn't on PATH and PyTorch can't see it

Step 1 — Diagnose Your Setup

Start here before touching anything:

python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
nvcc --version
nvidia-smi

torch.cuda.is_available() returning False is the smoking gun. bitsandbytes won't load GPU kernels no matter how it was compiled — fix PyTorch first.

Fix A — Reinstall bitsandbytes with CUDA (GPU Available)

Got an NVIDIA GPU? The problem is almost always a mismatched PyTorch build. Reinstall in order:

1. Reinstall PyTorch with CUDA support:

# For CUDA 12.1 (check yours with: nvcc --version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

2. Reinstall bitsandbytes:

pip uninstall bitsandbytes -y
pip install bitsandbytes

3. Confirm it's working:

python -c "import bitsandbytes as bnb; print(bnb.__version__)"
python -c "import torch; print('CUDA:', torch.cuda.is_available())"

Fix B — Windows-Specific

Standard bitsandbytes simply doesn't work on Windows. Use the community-maintained fork instead:

pip uninstall bitsandbytes -y
pip install bitsandbytes-windows

For serious LLM work on Windows, WSL2 with CUDA passthrough is more stable long-term:

# Inside WSL2 (Ubuntu)
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install bitsandbytes

Fix C — CPU-Only Machine (No GPU)

Hard limit: bitsandbytes 8-bit quantization requires a CUDA GPU. No CPU fallback exists in the library. You have three practical options:

Option 1: llama.cpp / GGUF (recommended for CPU)

This is the go-to for running LLMs on CPU. A Q4_K_M quantized Mistral 7B fits in roughly 6 GB of RAM:

pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    n_ctx=2048
)
output = llm("What is quantization?", max_tokens=200)

Option 2: Hugging Face without quantization

Drop the BitsAndBytesConfig entirely and load at full precision:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    torch_dtype=torch.float32,
    device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")

1–3B models are usable on CPU. A 7B model at float32 needs 28+ GB RAM and will be painfully slow.

Option 3: ctransformers (C++ backend)

pip install ctransformers

from ctransformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral"
)

Fix D — Google Colab / Cloud Notebooks

Getting this error on Colab despite selecting a GPU runtime? First, verify the GPU was actually assigned:

import subprocess
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
print(result.stdout)

If nvidia-smi fails, go to Runtime → Change runtime type → T4 GPU and reconnect. Then force-reinstall bitsandbytes:

!pip install -q bitsandbytes
import importlib, bitsandbytes
importlib.reload(bitsandbytes)

Verification — Confirm the Fix Worked

import torch
import bitsandbytes as bnb

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"bitsandbytes: {bnb.__version__}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    import torch.nn as nn
    layer_8bit = bnb.nn.Linear8bitLt(64, 64, has_fp16_weights=False).cuda()
    print("8-bit layer created successfully")

You're looking for this:

CUDA available: True
CUDA version: 12.1
bitsandbytes: 0.43.1
GPU: NVIDIA GeForce RTX 3090
8-bit layer created successfully

Version Compatibility

bitsandbytes >= 0.41.0 → supports CUDA 11.7, 11.8, 12.0, 12.1+
bitsandbytes 0.37–0.40 → CUDA 11.x only, will break on CUDA 12.x
Match bitsandbytes to PyTorch's CUDA version — not the system CUDA version

Check which CUDA PyTorch was actually built against — this is the number that matters for compatibility:

import torch
print(torch.version.cuda)