The Error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x?? in position ??: invalid start byte

Full example when reading a file:

Traceback (most recent call last):
  File "read.py", line 3, in <module>
    content = f.read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 14: invalid start byte

Root Cause

Python assumes UTF-8 by default. If the file was saved in a different encoding, the decoder trips on the first byte it doesn't recognize — byte 0xb2 in the example above is valid Latin-1 but illegal in UTF-8. Nine times out of ten, the culprit is one of these:

CSV/text files saved with Windows-1252 (CP1252) or Latin-1 (ISO-8859-1) — extremely common with Excel exports
Files created on Windows that default to ANSI encoding
Binary files accidentally opened in text mode
HTTP responses where the server sends a different charset than declared in headers
Database dumps from MySQL with latin1 charset

Fix 1: Detect the Actual Encoding First

Don't guess — use chardet to detect it. It's not always 100% accurate, but it's right often enough to save a lot of trial and error:

pip install chardet

import chardet

with open('data.csv', 'rb') as f:
    raw = f.read()
    result = chardet.detect(raw)
    print(result)  # {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

Then pass that encoding directly to open():

with open('data.csv', encoding='windows-1252') as f:
    content = f.read()

Fix 2: Try Common Encodings Manually

No chardet? These five encodings cover roughly 95% of real-world files. Try them in order:

encodings = ['utf-8', 'windows-1252', 'latin-1', 'utf-16', 'cp1250']

for enc in encodings:
    try:
        with open('data.txt', encoding=enc) as f:
            content = f.read()
        print(f'Works with: {enc}')
        break
    except UnicodeDecodeError:
        continue

Fix 3: Use `errors` Parameter as a Fallback

Sometimes you can't change the encoding — the file comes from a legacy system, a vendor, or an upload you don't control. In those cases, tell Python what to do with the bad bytes instead of crashing:

# Option A: Skip bad bytes entirely
with open('data.txt', encoding='utf-8', errors='ignore') as f:
    content = f.read()

# Option B: Replace bad bytes with � (Unicode replacement character)
with open('data.txt', encoding='utf-8', errors='replace') as f:
    content = f.read()

# Option C: Use backslashreplace for debugging — shows exactly which bytes failed
with open('data.txt', encoding='utf-8', errors='backslashreplace') as f:
    content = f.read()

Warning: errors='ignore' silently drops characters. Use it only when data loss is acceptable — names, addresses, and other freeform text fields rarely survive it cleanly.

Fix 4: Read as Binary, Decode Manually

Reach for this when chardet guesses wrong, or when you need to inspect the raw bytes before deciding what to do with them:

with open('data.txt', 'rb') as f:
    raw_bytes = f.read()

# Decode with detected or known encoding
content = raw_bytes.decode('windows-1252')

# Or re-encode to UTF-8 for downstream processing
content_utf8 = raw_bytes.decode('windows-1252').encode('utf-8').decode('utf-8')

Fix 5: HTTP Responses

APIs and scraped pages are sneaky — the Content-Type header might say UTF-8 while the actual body is Latin-1. The requests library gives you a few escape hatches:

import requests

response = requests.get('https://example.com/data')

# Option A: Let requests sniff the encoding from the body (uses chardet internally)
response.encoding = response.apparent_encoding
text = response.text

# Option B: Force a known encoding
response.encoding = 'windows-1252'
text = response.text

# Option C: Work with raw bytes and decode yourself
text = response.content.decode('latin-1')

Fix 6: pandas read_csv

This is the #1 place people hit this error. Excel exports to CSV on Windows almost always use Windows-1252, not UTF-8 — even when the file looks fine in Excel itself:

import pandas as pd

# Specify the encoding directly
df = pd.read_csv('export.csv', encoding='windows-1252')

# Unknown encoding? Replace bad bytes and inspect the damage later
df = pd.read_csv('export.csv', encoding_errors='replace')

# latin-1 accepts all 256 byte values — it never raises UnicodeDecodeError
df = pd.read_csv('export.csv', encoding='latin-1')

Fix 7: Python 2 → Python 3 Migration

Old Python 2 code treated strings as raw bytes. Moving to Python 3, you must be explicit. Use io.open() for code that needs to run on both versions:

import io

with io.open('legacy.txt', encoding='utf-8') as f:
    content = f.read()

Verify the Fix

Before shipping to production, run a quick sanity check. Garbage output from a wrong encoding can be subtle — it won't always crash:

with open('data.txt', encoding='windows-1252') as f:
    content = f.read()

# repr() shows \x?? sequences if something is still wrong
print(repr(content[:200]))
print(len(content))  # Should match expected file size

# Assert known content is present
assert 'expected_string' in content, 'Encoding mismatch — got garbage'

Prevention

Build these habits and you'll rarely see this error again:

Always specify encoding when opening files. Never rely on the system default. On Windows it's CP1252, on macOS it's UTF-8, on some Linux servers it's ASCII — your code breaks the moment it moves environments.

Bad

with open('file.txt') as f: ...

Good

with open('file.txt', encoding='utf-8') as f: ...

  
  - **Save files as UTF-8 from the start.** In VS Code, the encoding appears in the bottom-right status bar. Click it to change. One minute now saves an hour of debugging later.
  - **Validate at system boundaries.** Files arriving via upload, SFTP, or third-party API should be encoding-checked on arrival — not halfway through a batch job at 2am.
  - **Set `PYTHONIOENCODING` for piped output.** Scripts that write to stdout can still hit encoding issues when piped to another process:
    ```
PYTHONIOENCODING=utf-8 python myscript.py

Quick Encoding Cheat Sheet

utf-8 — the universal standard; use this wherever you control the data
windows-1252 / cp1252 — Windows Western European; the default for Excel CSV exports
latin-1 / iso-8859-1 — maps all 256 byte values; never raises UnicodeDecodeError, useful as a last resort
utf-16 — Windows Notepad's "Unicode" save format; has a BOM header at the start
shift_jis / cp932 — Japanese text from Windows systems

When your terminal is mangling output with its own encoding, inspecting the raw bytes in a browser can help. The Base64 Encoder/Decoder at toolcraft.app lets you paste raw bytes and see exactly what's in them — handy for those cases where repr() alone isn't enough.

Fix Python UnicodeDecodeError: 'utf-8' codec can't decode byte

The Error

Root Cause

Fix 1: Detect the Actual Encoding First

Fix 2: Try Common Encodings Manually

Fix 3: Use `errors` Parameter as a Fallback

Fix 4: Read as Binary, Decode Manually

Fix 5: HTTP Responses

Fix 6: pandas read_csv

Fix 7: Python 2 → Python 3 Migration

Verify the Fix

Prevention

Bad

Good

Quick Encoding Cheat Sheet

Related Error Notes

How to Fix 'sqlite3.OperationalError: database is locked' in Python

Fixing 'AttributeError: module collections has no attribute Callable' in Python 3.10+

How to Fix 'TypeError: 'dict_keys' object is not subscriptable' in Python

The Error

Root Cause

Fix 1: Detect the Actual Encoding First

Fix 2: Try Common Encodings Manually

Fix 3: Use errors Parameter as a Fallback

Fix 4: Read as Binary, Decode Manually

Fix 5: HTTP Responses

Fix 6: pandas read_csv

Fix 7: Python 2 → Python 3 Migration

Verify the Fix

Prevention

Bad

Good

Quick Encoding Cheat Sheet

Related Error Notes

How to Fix 'sqlite3.OperationalError: database is locked' in Python

Fixing 'AttributeError: module collections has no attribute Callable' in Python 3.10+

How to Fix 'TypeError: 'dict_keys' object is not subscriptable' in Python

Fix 3: Use `errors` Parameter as a Fallback