The Error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x?? in position ??: invalid start byte
Full example when reading a file:
Traceback (most recent call last):
File "read.py", line 3, in <module>
content = f.read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 14: invalid start byte
Root Cause
Python assumes UTF-8 by default. If the file was saved in a different encoding, the decoder trips on the first byte it doesn't recognize β byte 0xb2 in the example above is valid Latin-1 but illegal in UTF-8. Nine times out of ten, the culprit is one of these:
- CSV/text files saved with Windows-1252 (CP1252) or Latin-1 (ISO-8859-1) β extremely common with Excel exports
- Files created on Windows that default to ANSI encoding
- Binary files accidentally opened in text mode
- HTTP responses where the server sends a different charset than declared in headers
- Database dumps from MySQL with
latin1charset
Fix 1: Detect the Actual Encoding First
Don't guess β use chardet to detect it. It's not always 100% accurate, but it's right often enough to save a lot of trial and error:
pip install chardet
import chardet
with open('data.csv', 'rb') as f:
raw = f.read()
result = chardet.detect(raw)
print(result) # {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
Then pass that encoding directly to open():
with open('data.csv', encoding='windows-1252') as f:
content = f.read()
Fix 2: Try Common Encodings Manually
No chardet? These five encodings cover roughly 95% of real-world files. Try them in order:
encodings = ['utf-8', 'windows-1252', 'latin-1', 'utf-16', 'cp1250']
for enc in encodings:
try:
with open('data.txt', encoding=enc) as f:
content = f.read()
print(f'Works with: {enc}')
break
except UnicodeDecodeError:
continue
Fix 3: Use errors Parameter as a Fallback
Sometimes you can't change the encoding β the file comes from a legacy system, a vendor, or an upload you don't control. In those cases, tell Python what to do with the bad bytes instead of crashing:
# Option A: Skip bad bytes entirely
with open('data.txt', encoding='utf-8', errors='ignore') as f:
content = f.read()
# Option B: Replace bad bytes with οΏ½ (Unicode replacement character)
with open('data.txt', encoding='utf-8', errors='replace') as f:
content = f.read()
# Option C: Use backslashreplace for debugging β shows exactly which bytes failed
with open('data.txt', encoding='utf-8', errors='backslashreplace') as f:
content = f.read()
Warning: errors='ignore' silently drops characters. Use it only when data loss is acceptable β names, addresses, and other freeform text fields rarely survive it cleanly.
Fix 4: Read as Binary, Decode Manually
Reach for this when chardet guesses wrong, or when you need to inspect the raw bytes before deciding what to do with them:
with open('data.txt', 'rb') as f:
raw_bytes = f.read()
# Decode with detected or known encoding
content = raw_bytes.decode('windows-1252')
# Or re-encode to UTF-8 for downstream processing
content_utf8 = raw_bytes.decode('windows-1252').encode('utf-8').decode('utf-8')
Fix 5: HTTP Responses
APIs and scraped pages are sneaky β the Content-Type header might say UTF-8 while the actual body is Latin-1. The requests library gives you a few escape hatches:
import requests
response = requests.get('https://example.com/data')
# Option A: Let requests sniff the encoding from the body (uses chardet internally)
response.encoding = response.apparent_encoding
text = response.text
# Option B: Force a known encoding
response.encoding = 'windows-1252'
text = response.text
# Option C: Work with raw bytes and decode yourself
text = response.content.decode('latin-1')
Fix 6: pandas read_csv
This is the #1 place people hit this error. Excel exports to CSV on Windows almost always use Windows-1252, not UTF-8 β even when the file looks fine in Excel itself:
import pandas as pd
# Specify the encoding directly
df = pd.read_csv('export.csv', encoding='windows-1252')
# Unknown encoding? Replace bad bytes and inspect the damage later
df = pd.read_csv('export.csv', encoding_errors='replace')
# latin-1 accepts all 256 byte values β it never raises UnicodeDecodeError
df = pd.read_csv('export.csv', encoding='latin-1')
Fix 7: Python 2 β Python 3 Migration
Old Python 2 code treated strings as raw bytes. Moving to Python 3, you must be explicit. Use io.open() for code that needs to run on both versions:
import io
with io.open('legacy.txt', encoding='utf-8') as f:
content = f.read()
Verify the Fix
Before shipping to production, run a quick sanity check. Garbage output from a wrong encoding can be subtle β it won't always crash:
with open('data.txt', encoding='windows-1252') as f:
content = f.read()
# repr() shows \x?? sequences if something is still wrong
print(repr(content[:200]))
print(len(content)) # Should match expected file size
# Assert known content is present
assert 'expected_string' in content, 'Encoding mismatch β got garbage'
Prevention
Build these habits and you'll rarely see this error again:
- Always specify encoding when opening files. Never rely on the system default. On Windows it's CP1252, on macOS it's UTF-8, on some Linux servers it's ASCII β your code breaks the moment it moves environments.
Bad
with open('file.txt') as f: ...
Good
with open('file.txt', encoding='utf-8') as f: ...
- **Save files as UTF-8 from the start.** In VS Code, the encoding appears in the bottom-right status bar. Click it to change. One minute now saves an hour of debugging later.
- **Validate at system boundaries.** Files arriving via upload, SFTP, or third-party API should be encoding-checked on arrival β not halfway through a batch job at 2am.
- **Set `PYTHONIOENCODING` for piped output.** Scripts that write to stdout can still hit encoding issues when piped to another process:
```
PYTHONIOENCODING=utf-8 python myscript.py
Quick Encoding Cheat Sheet
utf-8β the universal standard; use this wherever you control the datawindows-1252/cp1252β Windows Western European; the default for Excel CSV exportslatin-1/iso-8859-1β maps all 256 byte values; never raisesUnicodeDecodeError, useful as a last resortutf-16β Windows Notepad's "Unicode" save format; has a BOM header at the startshift_jis/cp932β Japanese text from Windows systems
When your terminal is mangling output with its own encoding, inspecting the raw bytes in a browser can help. The Base64 Encoder/Decoder at toolcraft.app lets you paste raw bytes and see exactly what's in them β handy for those cases where repr() alone isn't enough.

