Fix PyTorch RuntimeError: "Expected all tensors to be on the same device" in HuggingFace

beginner🧠 AI Tools2026-05-25| Python 3.8+, PyTorch 1.10+, HuggingFace Transformers, CUDA-enabled GPU

Error Message

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

#huggingface#pytorch#cuda#device-mismatch#transformers

Error ScenarioIt’s a classic PyTorch moment: you’ve successfully loaded a heavy model like Llama-3 or Whisper onto your GPU, but the second you start inference, everything crashes. You likely moved the model using `model.to('cuda')`, yet the tensors returned by your tokenizer are still sitting in system RAM.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Why This HappensThe logic is strict. PyTorch cannot perform mathematical operations, such as matrix multiplication (`mat2` in the error), if the data is split across different hardware. Your model weights are currently on `cuda:0` (the GPU), but your input IDs are on the `cpu`.

This disconnect occurs because model.to('cuda') only migrates the model's parameters and buffers. The tokenizer() function creates fresh tensors on your CPU by default. When the GPU attempts to process these inputs, it hits a memory wall because it cannot directly access the CPU's memory space for that specific calculation.

The Immediate Solution: Manual MigrationTo get your code running, move your entire input dictionary to the GPU. Since HuggingFace tokenizers return a dictionary-like object (`BatchEncoding`), you can use a concise dictionary comprehension to shift every tensor simultaneously.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Detect hardware
device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize model and move it to the GPU
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)

inputs = tokenizer("Why is my tensor on the wrong device?", return_tensors="pt")

# FIX: Sync inputs with the model's device
inputs = {k: v.to(device) for k, v in inputs.items()}

output = model.generate(**inputs)

Long-term Best Practices### 1. Leverage device_map="auto"For modern LLMs, use the `accelerate` library. By setting `device_map="auto"`, HuggingFace automatically manages the placement of model layers across available VRAM and handles the input routing for you.

# Requires: pip install accelerate
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto"
)

2. Creating On-the-Fly TensorsIf your custom forward pass generates new tensors (like attention masks or position increments), never hardcode them to CPU. Instead, always reference the model's existing device attribute to ensure compatibility.

# AVOID: Defaults to CPU and causes crashes
mask = torch.ones((1, 10))

# PREFERRED: Inherits the model's current device automatically
mask = torch.ones((1, 10), device=model.device)

3. The Pipeline ShortcutIf you prefer simplicity, use the `pipeline` API. Passing `device=0` ensures that both the model and any data passed to it are handled on the GPU automatically, removing the need for manual `.to()` calls.

from transformers import pipeline

# device=0 maps to cuda:0
pipe = pipeline("sentiment-analysis", model="distilbert-base-uncased", device=0)

# Inputs are moved to the GPU behind the scenes
pipe("This fix worked perfectly!")

Quick VerificationVerify your setup by checking a single parameter from the model and comparing it against your input tensors. This 2-line check saves hours of debugging:

print(f"Model location: {next(model.parameters()).device}")
print(f"Input location: {inputs['input_ids'].device}")

If one says cuda:0 and the other says cpu, you still have a mismatch. Both must align for the code to execute.

Fix PyTorch RuntimeError: "Expected all tensors to be on the same device" in HuggingFace

Why This HappensThe logic is strict. PyTorch cannot perform mathematical operations, such as matrix multiplication (`mat2` in the error), if the data is split across different hardware. Your model weights are currently on `cuda:0` (the GPU), but your input IDs are on the `cpu`.

The Immediate Solution: Manual MigrationTo get your code running, move your entire input dictionary to the GPU. Since HuggingFace tokenizers return a dictionary-like object (`BatchEncoding`), you can use a concise dictionary comprehension to shift every tensor simultaneously.

Long-term Best Practices### 1. Leverage device_map="auto"For modern LLMs, use the `accelerate` library. By setting `device_map="auto"`, HuggingFace automatically manages the placement of model layers across available VRAM and handles the input routing for you.

2. Creating On-the-Fly TensorsIf your custom forward pass generates new tensors (like attention masks or position increments), never hardcode them to CPU. Instead, always reference the model's existing device attribute to ensure compatibility.

3. The Pipeline ShortcutIf you prefer simplicity, use the `pipeline` API. Passing `device=0` ensures that both the model and any data passed to it are handled on the GPU automatically, removing the need for manual `.to()` calls.

Quick VerificationVerify your setup by checking a single parameter from the model and comparing it against your input tensors. This 2-line check saves hours of debugging:

Related Error Notes

Fixing the 'ConversationBufferMemory' ImportError in LangChain v0.3

Fixing the 'Failed building wheel for llama-cpp-python' Error

Fix Mistral AI 422 Unprocessable Entity: MistralAPIStatusException on Bad API Parameters

Why This HappensThe logic is strict. PyTorch cannot perform mathematical operations, such as matrix multiplication (mat2 in the error), if the data is split across different hardware. Your model weights are currently on cuda:0 (the GPU), but your input IDs are on the cpu.

The Immediate Solution: Manual MigrationTo get your code running, move your entire input dictionary to the GPU. Since HuggingFace tokenizers return a dictionary-like object (BatchEncoding), you can use a concise dictionary comprehension to shift every tensor simultaneously.

Long-term Best Practices### 1. Leverage device_map="auto"For modern LLMs, use the accelerate library. By setting device_map="auto", HuggingFace automatically manages the placement of model layers across available VRAM and handles the input routing for you.

2. Creating On-the-Fly TensorsIf your custom forward pass generates new tensors (like attention masks or position increments), never hardcode them to CPU. Instead, always reference the model's existing device attribute to ensure compatibility.

3. The Pipeline ShortcutIf you prefer simplicity, use the pipeline API. Passing device=0 ensures that both the model and any data passed to it are handled on the GPU automatically, removing the need for manual .to() calls.

Quick VerificationVerify your setup by checking a single parameter from the model and comparing it against your input tensors. This 2-line check saves hours of debugging:

Related Error Notes

Fixing the 'ConversationBufferMemory' ImportError in LangChain v0.3

Fixing the 'Failed building wheel for llama-cpp-python' Error

Fix Mistral AI 422 Unprocessable Entity: MistralAPIStatusException on Bad API Parameters

Why This HappensThe logic is strict. PyTorch cannot perform mathematical operations, such as matrix multiplication (`mat2` in the error), if the data is split across different hardware. Your model weights are currently on `cuda:0` (the GPU), but your input IDs are on the `cpu`.

The Immediate Solution: Manual MigrationTo get your code running, move your entire input dictionary to the GPU. Since HuggingFace tokenizers return a dictionary-like object (`BatchEncoding`), you can use a concise dictionary comprehension to shift every tensor simultaneously.

Long-term Best Practices### 1. Leverage device_map="auto"For modern LLMs, use the `accelerate` library. By setting `device_map="auto"`, HuggingFace automatically manages the placement of model layers across available VRAM and handles the input routing for you.

3. The Pipeline ShortcutIf you prefer simplicity, use the `pipeline` API. Passing `device=0` ensures that both the model and any data passed to it are handled on the GPU automatically, removing the need for manual `.to()` calls.