Fix Python MemoryError When Processing Large Data

The Error

You're crunching through a large CSV, loading a dataset into a list, or running heavy computation — then Python just dies:

MemoryError

Sometimes you get a traceback with more context:

Traceback (most recent call last):
  File "process.py", line 12, in <module>
    data = [line for line in open('huge_file.csv').readlines()]
MemoryError

Python tried to allocate memory your system couldn't provide. It ran out of RAM.

Why This Happens

The most common cause: loading everything at once. Take a 4 GB CSV. Read it with readlines() and Python doesn't just use 4 GB — Python object overhead multiplies that by 3–5x, so you're looking at 12–20 GB of RAM just to hold the data.

Other causes:

Building giant lists or dicts in a loop without releasing references
NumPy operations that create large intermediate arrays
Constrained environments: VPS with 2 GB RAM, Docker containers, CI runners
32-bit Python hitting its hard 2 GB per-process address space limit

Fix 1: Read Files in Chunks

Python's file objects are lazy by default. Stop fighting that and use it.

# Bad: pulls the entire file into RAM
with open('huge_file.txt') as f:
    lines = f.readlines()  # MemoryError here

# Good: one line at a time, O(1) memory
with open('huge_file.txt') as f:
    for line in f:
        process(line)

For pandas, the chunksize parameter gives you the same control:

import pandas as pd

for chunk in pd.read_csv('huge_file.csv', chunksize=100_000):
    # chunk is a DataFrame with 100k rows — manageable
    result = chunk.groupby('category')['value'].sum()
    save_partial_result(result)

Fix 2: Use Generators Instead of Lists

Building a list just to loop through it once is wasteful. A generator computes each value on demand and holds almost nothing in memory.

# Bad: 10 million integers all at once in RAM
squares = [x**2 for x in range(10_000_000)]

# Good: generator expression, computes one value at a time
squares = (x**2 for x in range(10_000_000))

for val in squares:
    process(val)

For file processing, yield turns any function into a generator:

def read_records(filepath):
    with open(filepath) as f:
        for line in f:
            yield parse(line)

for record in read_records('big.log'):
    process(record)

Fix 3: Reduce Memory Usage with NumPy dtypes

NumPy defaults to float64 — 8 bytes per element. For 100 million elements, that's 800 MB. Switch to float32 and you cut it to 400 MB. Use uint8 for 0–255 integers and it drops to 100 MB.

import numpy as np

# Default: float64 = 8 bytes/element → ~800 MB for 100M elements
arr = np.array(data)

# float32 = 4 bytes/element → ~400 MB
arr = np.array(data, dtype=np.float32)

# uint8 = 1 byte/element → ~100 MB
arr = np.array(data, dtype=np.uint8)

Same principle applies to pandas — declare dtypes upfront instead of letting pandas guess:

df = pd.read_csv('data.csv', dtype={
    'user_id': 'int32',
    'score': 'float32',
    'category': 'category'  # repeated strings → categorical saves a lot
})

Fix 4: Use Memory-Mapped Files

For large binary files or NumPy arrays, memory mapping hands paging control to the OS. You get array-style access without loading the whole file upfront — the OS fetches only the pages you actually touch.

import numpy as np

# Doesn't read the whole file — maps it
arr = np.load('large_array.npy', mmap_mode='r')

# Only this slice loads into RAM
subset = arr[1000:2000]

Fix 5: Process with Dask for Out-of-Core DataFrames

When you need pandas-style operations on data that doesn't fit in RAM, Dask handles it natively. The API is nearly identical, but execution is lazy and chunked:

pip install dask[dataframe]

import dask.dataframe as dd

# Builds a lazy computation graph — nothing loads yet
df = dd.read_csv('huge_file.csv')

# Also lazy
result = df.groupby('category')['value'].sum()

# .compute() triggers actual execution, chunk by chunk
print(result.compute())

Fix 6: Delete Objects and Force Garbage Collection

Working with large objects sequentially? Delete them explicitly when you're done rather than waiting for Python's GC to catch up:

import gc

for batch in batches:
    result = process(batch)
    save(result)
    del result
    del batch
    gc.collect()  # force collection if memory is tight

Treat this as a last resort. If you're regularly hitting MemoryError, restructuring around generators or chunks will serve you better long-term.

Fix 7: Add Swap Space (Linux)

On a server where you can't refactor immediately, swap space prevents the crash — at the cost of speed. A 4 GB swap file on an SSD might slow things down 10–20x compared to RAM, but it beats a crashed process.

# Check what you currently have
free -h

# Create a 4GB swap file
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

This buys time. It doesn't fix the root cause.

Verify the Fix

Use memory_profiler to confirm your changes actually reduced peak usage:

pip install memory-profiler

from memory_profiler import profile

@profile
def my_function():
    # your code here
    pass

my_function()

The output shows memory usage line by line. After switching to chunking or generators, peak RAM should drop from gigabytes to tens of megabytes for most workloads.

Watch live usage in a second terminal while your script runs:

watch -n 1 'free -h'

Prevention

Profile before scaling: Run memory_profiler on a small sample first. Catching a 10x memory spike at 1k rows is much cheaper than debugging it at 10M rows.
Default to generators: Any function that produces a sequence should use yield unless you have a concrete reason to materialize the whole list.
Set explicit dtypes in pandas: Letting pandas infer types defaults everything to int64/float64. On a 50-column dataset, that's often 2–4x more memory than necessary.
Benchmark chunk sizes: Too small adds I/O overhead; too large spikes memory. For most workloads, 50k–200k rows per chunk is a reasonable starting range — tune from there.