The Error
You're crunching through a large CSV, loading a dataset into a list, or running heavy computation β then Python just dies:
MemoryError
Sometimes you get a traceback with more context:
Traceback (most recent call last):
File "process.py", line 12, in <module>
data = [line for line in open('huge_file.csv').readlines()]
MemoryError
Python tried to allocate memory your system couldn't provide. It ran out of RAM.
Why This Happens
The most common cause: loading everything at once. Take a 4 GB CSV. Read it with readlines() and Python doesn't just use 4 GB β Python object overhead multiplies that by 3β5x, so you're looking at 12β20 GB of RAM just to hold the data.
Other causes:
- Building giant lists or dicts in a loop without releasing references
- NumPy operations that create large intermediate arrays
- Constrained environments: VPS with 2 GB RAM, Docker containers, CI runners
- 32-bit Python hitting its hard 2 GB per-process address space limit
Fix 1: Read Files in Chunks
Python's file objects are lazy by default. Stop fighting that and use it.
# Bad: pulls the entire file into RAM
with open('huge_file.txt') as f:
lines = f.readlines() # MemoryError here
# Good: one line at a time, O(1) memory
with open('huge_file.txt') as f:
for line in f:
process(line)
For pandas, the chunksize parameter gives you the same control:
import pandas as pd
for chunk in pd.read_csv('huge_file.csv', chunksize=100_000):
# chunk is a DataFrame with 100k rows β manageable
result = chunk.groupby('category')['value'].sum()
save_partial_result(result)
Fix 2: Use Generators Instead of Lists
Building a list just to loop through it once is wasteful. A generator computes each value on demand and holds almost nothing in memory.
# Bad: 10 million integers all at once in RAM
squares = [x**2 for x in range(10_000_000)]
# Good: generator expression, computes one value at a time
squares = (x**2 for x in range(10_000_000))
for val in squares:
process(val)
For file processing, yield turns any function into a generator:
def read_records(filepath):
with open(filepath) as f:
for line in f:
yield parse(line)
for record in read_records('big.log'):
process(record)
Fix 3: Reduce Memory Usage with NumPy dtypes
NumPy defaults to float64 β 8 bytes per element. For 100 million elements, that's 800 MB. Switch to float32 and you cut it to 400 MB. Use uint8 for 0β255 integers and it drops to 100 MB.
import numpy as np
# Default: float64 = 8 bytes/element β ~800 MB for 100M elements
arr = np.array(data)
# float32 = 4 bytes/element β ~400 MB
arr = np.array(data, dtype=np.float32)
# uint8 = 1 byte/element β ~100 MB
arr = np.array(data, dtype=np.uint8)
Same principle applies to pandas β declare dtypes upfront instead of letting pandas guess:
df = pd.read_csv('data.csv', dtype={
'user_id': 'int32',
'score': 'float32',
'category': 'category' # repeated strings β categorical saves a lot
})
Fix 4: Use Memory-Mapped Files
For large binary files or NumPy arrays, memory mapping hands paging control to the OS. You get array-style access without loading the whole file upfront β the OS fetches only the pages you actually touch.
import numpy as np
# Doesn't read the whole file β maps it
arr = np.load('large_array.npy', mmap_mode='r')
# Only this slice loads into RAM
subset = arr[1000:2000]
Fix 5: Process with Dask for Out-of-Core DataFrames
When you need pandas-style operations on data that doesn't fit in RAM, Dask handles it natively. The API is nearly identical, but execution is lazy and chunked:
pip install dask[dataframe]
import dask.dataframe as dd
# Builds a lazy computation graph β nothing loads yet
df = dd.read_csv('huge_file.csv')
# Also lazy
result = df.groupby('category')['value'].sum()
# .compute() triggers actual execution, chunk by chunk
print(result.compute())
Fix 6: Delete Objects and Force Garbage Collection
Working with large objects sequentially? Delete them explicitly when you're done rather than waiting for Python's GC to catch up:
import gc
for batch in batches:
result = process(batch)
save(result)
del result
del batch
gc.collect() # force collection if memory is tight
Treat this as a last resort. If you're regularly hitting MemoryError, restructuring around generators or chunks will serve you better long-term.
Fix 7: Add Swap Space (Linux)
On a server where you can't refactor immediately, swap space prevents the crash β at the cost of speed. A 4 GB swap file on an SSD might slow things down 10β20x compared to RAM, but it beats a crashed process.
# Check what you currently have
free -h
# Create a 4GB swap file
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
This buys time. It doesn't fix the root cause.
Verify the Fix
Use memory_profiler to confirm your changes actually reduced peak usage:
pip install memory-profiler
from memory_profiler import profile
@profile
def my_function():
# your code here
pass
my_function()
The output shows memory usage line by line. After switching to chunking or generators, peak RAM should drop from gigabytes to tens of megabytes for most workloads.
Watch live usage in a second terminal while your script runs:
watch -n 1 'free -h'
Prevention
- Profile before scaling: Run
memory_profileron a small sample first. Catching a 10x memory spike at 1k rows is much cheaper than debugging it at 10M rows. - Default to generators: Any function that produces a sequence should use
yieldunless you have a concrete reason to materialize the whole list. - Set explicit dtypes in pandas: Letting pandas infer types defaults everything to
int64/float64. On a 50-column dataset, that's often 2β4x more memory than necessary. - Benchmark chunk sizes: Too small adds I/O overhead; too large spikes memory. For most workloads, 50kβ200k rows per chunk is a reasonable starting range β tune from there.

