Fixing AWS Glue OOM: Resolving 'java.lang.OutOfMemoryError: Java heap space'

The Error Message

Your AWS Glue ETL job was humming along until it suddenly quit. You check the CloudWatch logs and find this specific failure:

Command failed with exit code 1. java.lang.OutOfMemoryError: Java heap space

This crash happens when the Spark driver or an executor runs out of JVM memory. Essentially, the data you're trying to process is physically larger than the RAM available in your Glue worker nodes.

Why Glue Jobs Run Out of Heap Space

Spark memory issues rarely happen by accident. They usually stem from how the data is structured or how the code handles it. Here are the most common culprits:

Data Skew: One partition is 10x larger than the others. This forces a single worker to do all the heavy lifting while others sit idle, eventually causing that overworked node to crash.
Large Broadcast Joins: Spark tries to perform a broadcast join by sending a "small" table to every worker. If that table is actually 500MB or 1GB, it can quickly overwhelm the default 10MB broadcast threshold and eat up the heap.
The Small File Problem: If you are reading 50,000 files that are only 10KB each, the Spark driver will choke trying to track the metadata for every single file.
Memory-Heavy Operations: Using .collect() or .toPandas() pulls the entire dataset into the driver node's memory. If your dataset is 5GB and your driver only has 4GB of heap space, it will fail instantly.

Step 1: Scale Up the Worker Type

Sometimes you just need a bigger boat. If your G.1X workers (16GB RAM) are failing, upgrading to G.2X (32GB RAM) is the fastest way to stabilize the job. This gives Spark more room to breathe during shuffles and joins.

To change this in the AWS Console:

Navigate to your Glue Job configuration.
Find Worker type under the Job details tab.
Switch from G.1X to G.2X.
Consider enabling Flex execution for non-urgent jobs to save up to 34% on costs while using these larger workers.

Step 2: Fix Data Skew and Small Files

If scaling up doesn't work, your data is likely unevenly distributed. You can force Spark to spread the load by repartitioning the data based on a high-cardinality column.

# Distribute data across 200 even partitions
df = df.repartition(200, "user_id")

To fix the small file problem in S3, use Glue’s groupFiles feature. This merges tiny files into larger, more manageable chunks (e.g., 128MB) before Spark starts processing them.

datasource = glueContext.create_dynamic_frame.from_catalog(
    database = "sales_db", 
    table_name = "transactions", 
    additional_options = {"groupFiles": "inPartition", "groupSize": "134217728"} # 128MB groups
)

Step 3: Fine-Tune Spark Memory

The default Spark settings aren't always optimal for every ETL pattern. You can override these by adding parameters to your Glue job. These settings help prevent the driver from trying to do too much at once.

--conf spark.sql.autoBroadcastJoinThreshold: Set to -1 to disable automatic broadcasting if you have large lookup tables.
--conf spark.driver.maxResultSize: Increase this to 4g or higher if your driver is crashing during final data collection.

Add these in the Job parameters section of your Glue configuration:

Key: --conf
Value: spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.driver.maxResultSize=4g

Step 4: Leverage Glue Dynamic Frames

Native Spark DataFrames require a fixed schema, which can be memory-intensive when dealing with nested JSON or varying fields. Glue Dynamic Frames are lazily evaluated and handle schema changes more gracefully. Use them for the initial heavy lifting before converting to a Spark DataFrame for complex math.

# Use Glue's native Filter transform instead of Spark SQL
dynamic_frame = glueContext.create_dynamic_frame.from_options(...)
filtered_frame = Filter.apply(frame = dynamic_frame, f = lambda x: x["status"] == "active")

Verification

How do you know it's actually fixed? Check these three areas:

CloudWatch Metrics: Monitor glue.driver.jvm.heap.usage. If it stays below 70-80%, your job is healthy. If it spikes to 95% and stays there, you are still at risk.
Spark UI: Open the Spark History Server. Look for a single task that takes much longer than others; this confirms you still have a data skew problem.
Log Status: The job should finish with a Succeeded status and no "Exit Code 1" in the logs.

Pro-tips for Long-term Stability

Enable Auto-scaling: Let Glue manage the worker count. It will add nodes when memory pressure rises and drop them when the job winds down.

--enable-auto-scaling: true

Stop using .collect(): This is the #1 cause of driver OOMs. If you need to see data, use df.show(10) instead of pulling the whole set into the driver.
Pushdown Predicates: Filter your data at the S3 level so you never load unnecessary rows into memory.

Fixing AWS Glue OOM: Resolving 'java.lang.OutOfMemoryError: Java heap space'

The Error Message

Why Glue Jobs Run Out of Heap Space

Step 1: Scale Up the Worker Type

Step 2: Fix Data Skew and Small Files

Step 3: Fine-Tune Spark Memory

Step 4: Leverage Glue Dynamic Frames

Verification

Pro-tips for Long-term Stability

Related Error Notes

Why Your AWS ACM Certificate is Stuck in 'Pending Validation' (and How to Fix It)

Fixing AWS ElastiCache Redis Error 111: Connection Refused on Port 6379

Fix AccessDeniedException: User s3.amazonaws.com is Not Authorized to Perform lambda:InvokeFunction