Sửa lỗi 'bitsandbytes was compiled without GPU support' RuntimeError khi Quantization

TL;DR Cách Sửa Nhanh

Một nguyên nhân, nhiều hướng xử lý: bitsandbytes không tìm thấy GPU hỗ trợ CUDA. Bước tiếp theo phụ thuộc vào cấu hình của bạn:

Có GPU nhưng CUDA không được nhận diện → cài lại bitsandbytes với wheel CUDA tương ứng
Máy chỉ dùng CPU → bỏ hẳn 8-bit và dùng GGUF/llama.cpp thay thế
Windows → cài bitsandbytes-windows, hoặc dùng WSL2 với CUDA passthrough

Lỗi Đầy Đủ

RuntimeError: bitsandbytes was compiled without GPU support. 8-bit optimizers and quantization require a GPU to function.

Lỗi thường xuất hiện khi chạy đoạn code như sau:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config
)

Nguyên Nhân Gốc Rễ

bitsandbytes liên kết với các CUDA kernel ngay khi load. Không có CUDA runtime thì không có kernel. Đơn giản vậy thôi.

Ba trường hợp dẫn đến lỗi này:

Bạn cài torch từ PyPI mặc định — bản chỉ dùng CPU — thay vì từ index chuyên biệt cho CUDA
Máy không có GPU NVIDIA (máy chủ CPU trên cloud, Mac, server CI/CD)
CUDA đã cài trên hệ thống, nhưng nvcc hoặc CUDA runtime chưa được thêm vào PATH và PyTorch không nhận ra

Bước 1 — Chẩn Đoán Cấu Hình

Kiểm tra trước khi làm bất cứ điều gì:

python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
nvcc --version
nvidia-smi

torch.cuda.is_available() trả về False là dấu hiệu rõ ràng nhất. bitsandbytes sẽ không load GPU kernel dù được biên dịch thế nào — hãy sửa PyTorch trước.

Cách Sửa A — Cài Lại bitsandbytes với CUDA (Có GPU)

Có GPU NVIDIA? Vấn đề hầu như luôn là bản PyTorch không khớp. Cài lại theo thứ tự:

1. Cài lại PyTorch với hỗ trợ CUDA:

# Với CUDA 12.1 (kiểm tra phiên bản của bạn bằng: nvcc --version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Với CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

2. Cài lại bitsandbytes:

pip uninstall bitsandbytes -y
pip install bitsandbytes

3. Xác nhận đã hoạt động:

python -c "import bitsandbytes as bnb; print(bnb.__version__)"
python -c "import torch; print('CUDA:', torch.cuda.is_available())"

Cách Sửa B — Dành Riêng Cho Windows

bitsandbytes tiêu chuẩn đơn giản là không chạy được trên Windows. Dùng bản fork do cộng đồng duy trì thay thế:

pip uninstall bitsandbytes -y
pip install bitsandbytes-windows

Đối với các tác vụ LLM nghiêm túc trên Windows, WSL2 với CUDA passthrough ổn định hơn về lâu dài:

# Bên trong WSL2 (Ubuntu)
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install bitsandbytes

Cách Sửa C — Máy Chỉ Dùng CPU (Không Có GPU)

Giới hạn cứng: quantization 8-bit của bitsandbytes yêu cầu GPU CUDA. Thư viện không có chế độ dự phòng trên CPU. Bạn có ba lựa chọn thực tế:

Lựa chọn 1: llama.cpp / GGUF (khuyến nghị cho CPU)

Đây là lựa chọn hàng đầu để chạy LLM trên CPU. Mistral 7B quantized Q4_K_M chỉ cần khoảng 6 GB RAM:

pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    n_ctx=2048
)
output = llm("What is quantization?", max_tokens=200)

Lựa chọn 2: Hugging Face không dùng quantization

Bỏ hoàn toàn BitsAndBytesConfig và load model với độ chính xác đầy đủ:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    torch_dtype=torch.float32,
    device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")

Các model 1–3B dùng được trên CPU. Model 7B ở float32 cần hơn 28 GB RAM và sẽ chạy rất chậm.

Lựa chọn 3: ctransformers (backend C++)

pip install ctransformers

from ctransformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral"
)

Cách Sửa D — Google Colab / Cloud Notebooks

Gặp lỗi này trên Colab dù đã chọn runtime GPU? Trước tiên, xác nhận GPU có thực sự được cấp phát không:

import subprocess
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
print(result.stdout)

Nếu nvidia-smi thất bại, vào Runtime → Change runtime type → T4 GPU và kết nối lại. Sau đó cài lại bitsandbytes bắt buộc:

!pip install -q bitsandbytes
import importlib, bitsandbytes
importlib.reload(bitsandbytes)

Kiểm Tra — Xác Nhận Đã Sửa Thành Công

import torch
import bitsandbytes as bnb

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"bitsandbytes: {bnb.__version__}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    import torch.nn as nn
    layer_8bit = bnb.nn.Linear8bitLt(64, 64, has_fp16_weights=False).cuda()
    print("8-bit layer created successfully")

Kết quả bạn cần thấy:

CUDA available: True
CUDA version: 12.1
bitsandbytes: 0.43.1
GPU: NVIDIA GeForce RTX 3090
8-bit layer created successfully

Tương Thích Phiên Bản

bitsandbytes >= 0.41.0 → hỗ trợ CUDA 11.7, 11.8, 12.0, 12.1+
bitsandbytes 0.37–0.40 → chỉ CUDA 11.x, sẽ lỗi với CUDA 12.x
Khớp bitsandbytes với phiên bản CUDA của PyTorch — không phải phiên bản CUDA trên hệ thống

Kiểm tra xem PyTorch thực sự được build với CUDA nào — đây mới là con số quan trọng cho việc tương thích:

import torch
print(torch.version.cuda)