Fix Python UnicodeDecodeError: 'charmap' codec can't decode byte on Windows

The Error

You're reading a text file in Python on Windows and suddenly hit this:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10859: character maps to <undefined>

The file opens fine in Notepad or VS Code, but Python refuses. The reason comes down to one thing: encoding defaults.

Root Cause

Python's open() on Windows defaults to the system locale encoding — usually cp1252 (Windows-1252, also called charmap). It only covers a subset of Latin characters.

When your file contains bytes outside that range — UTF-8 text, files from a different codepage, anything with emoji or extended Unicode — Python throws UnicodeDecodeError. Byte 0x9d is a classic example: valid in UTF-8, but undefined in cp1252.

On macOS and Linux this rarely happens. Their default is already UTF-8. Windows is the outlier.

Fix 1: Specify UTF-8 Encoding (Fixes 90% of Cases)

Just add encoding='utf-8' to your open() call:

# Before (breaks on Windows)
with open('data.txt') as f:
    content = f.read()

# After (works everywhere)
with open('data.txt', encoding='utf-8') as f:
    content = f.read()

If the file was saved as UTF-8 — which most modern tools do by default — this is all you need. One line change.

Verify it worked

with open('data.txt', encoding='utf-8') as f:
    content = f.read()
print(f"Read {len(content)} characters successfully")

No exception. Fixed.

Fix 2: When You Don't Know the Encoding

Getting files from a client or third-party system? The encoding could be anything — cp1252, latin-1, shift-jis, GBK. Use chardet to detect it automatically.

pip install chardet

import chardet

# Read raw bytes first
with open('data.txt', 'rb') as f:
    raw = f.read()

result = chardet.detect(raw)
detected_encoding = result['encoding']
confidence = result['confidence']
print(f"Detected: {detected_encoding} (confidence: {confidence:.0%})")

# Now read with the detected encoding
with open('data.txt', encoding=detected_encoding) as f:
    content = f.read()

Confidence below 70%? The file might be corrupted, mixed-encoding, or from a very obscure codepage. Check with the sender before doing anything with the output.

Fix 3: Skip or Replace Bad Bytes

Need to read the file fast and don't care about a few unreadable characters? The errors parameter handles this:

# Replace undecodable bytes with the replacement character (U+FFFD)
with open('data.txt', encoding='utf-8', errors='replace') as f:
    content = f.read()

# Or drop them entirely
with open('data.txt', encoding='utf-8', errors='ignore') as f:
    content = f.read()

Fair warning: errors='ignore' silently drops data. If you're processing customer records or financial exports, you'll never know what went missing. Reserve this for cases where the unreadable characters genuinely don't matter.

Fix 4: Force UTF-8 for the Whole Process

Running a script with file operations scattered throughout? Set PYTHONUTF8=1 once and every open() call defaults to UTF-8 automatically:

# Command Prompt
set PYTHONUTF8=1
python your_script.py

# PowerShell
$env:PYTHONUTF8 = "1"
python your_script.py

Or pass it inline with the -X flag:

python -X utf8 your_script.py

UTF-8 mode applies to the entire process — file I/O, stdin, stdout, stderr, everything. Available in Python 3.7+.

To make it permanent, add PYTHONUTF8=1 to your Windows environment variables: System Properties → Advanced → Environment Variables → New (under "User variables").

Fix 5: latin-1 for Old Windows Files

Some legacy files — exports from old Access databases, ancient CRMs, government data portals — are genuinely cp1252 or latin-1. If chardet confirms this, read with the right codec:

with open('legacy_file.txt', encoding='cp1252') as f:
    content = f.read()

# latin-1 maps every byte 0x00-0xFF to a Unicode character
# so it never raises UnicodeDecodeError
with open('mystery_file.txt', encoding='latin-1') as f:
    content = f.read()

Latin-1 accepts every possible byte value without error — that's what makes it useful as a fallback. The catch: if the file is actually UTF-8, the output will be garbled. Use it last, not first.

Checking Your System's Default Encoding

Not sure what Python is defaulting to on your machine? Run this:

import sys
import locale

print(sys.getdefaultencoding())      # usually 'utf-8'
print(sys.getfilesystemencoding())   # 'utf-8' on Mac/Linux, 'mbcs' on Windows
print(locale.getpreferredencoding()) # what open() actually uses — often 'cp1252'

If locale.getpreferredencoding() returns cp1252, that's the culprit. Every open() call without an explicit encoding= uses it — silently, with no warning.

Prevention

Always write encoding='utf-8' in every open() call. Do it even on Linux and macOS where UTF-8 is the default — your code will run on Windows someday.
Save files as UTF-8 in your editor. VS Code defaults to UTF-8. Notepad on Windows 10 and 11 also defaults to UTF-8 now, but older versions defaulted to ANSI/cp1252.
If you publish a library, accept an encoding parameter in any function that reads files. Document what the default is.
Pylint's W1514 (unspecified-encoding) flags every open() without an explicit encoding. Wire it into your CI pipeline and you'll catch these before they reach production.