The Error Message
Your AWS Glue ETL job was humming along until it suddenly quit. You check the CloudWatch logs and find this specific failure:
Command failed with exit code 1. java.lang.OutOfMemoryError: Java heap space
This crash happens when the Spark driver or an executor runs out of JVM memory. Essentially, the data you're trying to process is physically larger than the RAM available in your Glue worker nodes.
Why Glue Jobs Run Out of Heap Space
Spark memory issues rarely happen by accident. They usually stem from how the data is structured or how the code handles it. Here are the most common culprits:
- Data Skew: One partition is 10x larger than the others. This forces a single worker to do all the heavy lifting while others sit idle, eventually causing that overworked node to crash.
- Large Broadcast Joins: Spark tries to perform a broadcast join by sending a "small" table to every worker. If that table is actually 500MB or 1GB, it can quickly overwhelm the default 10MB broadcast threshold and eat up the heap.
- The Small File Problem: If you are reading 50,000 files that are only 10KB each, the Spark driver will choke trying to track the metadata for every single file.
- Memory-Heavy Operations: Using
.collect()or.toPandas()pulls the entire dataset into the driver node's memory. If your dataset is 5GB and your driver only has 4GB of heap space, it will fail instantly.
Step 1: Scale Up the Worker Type
Sometimes you just need a bigger boat. If your G.1X workers (16GB RAM) are failing, upgrading to G.2X (32GB RAM) is the fastest way to stabilize the job. This gives Spark more room to breathe during shuffles and joins.
To change this in the AWS Console:
- Navigate to your Glue Job configuration.
- Find Worker type under the Job details tab.
- Switch from
G.1XtoG.2X. - Consider enabling Flex execution for non-urgent jobs to save up to 34% on costs while using these larger workers.
Step 2: Fix Data Skew and Small Files
If scaling up doesn't work, your data is likely unevenly distributed. You can force Spark to spread the load by repartitioning the data based on a high-cardinality column.
# Distribute data across 200 even partitions
df = df.repartition(200, "user_id")
To fix the small file problem in S3, use Glue’s groupFiles feature. This merges tiny files into larger, more manageable chunks (e.g., 128MB) before Spark starts processing them.
datasource = glueContext.create_dynamic_frame.from_catalog(
database = "sales_db",
table_name = "transactions",
additional_options = {"groupFiles": "inPartition", "groupSize": "134217728"} # 128MB groups
)
Step 3: Fine-Tune Spark Memory
The default Spark settings aren't always optimal for every ETL pattern. You can override these by adding parameters to your Glue job. These settings help prevent the driver from trying to do too much at once.
--conf spark.sql.autoBroadcastJoinThreshold: Set to-1to disable automatic broadcasting if you have large lookup tables.--conf spark.driver.maxResultSize: Increase this to4gor higher if your driver is crashing during final data collection.
Add these in the Job parameters section of your Glue configuration:
Key: --conf
Value: spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.driver.maxResultSize=4g
Step 4: Leverage Glue Dynamic Frames
Native Spark DataFrames require a fixed schema, which can be memory-intensive when dealing with nested JSON or varying fields. Glue Dynamic Frames are lazily evaluated and handle schema changes more gracefully. Use them for the initial heavy lifting before converting to a Spark DataFrame for complex math.
# Use Glue's native Filter transform instead of Spark SQL
dynamic_frame = glueContext.create_dynamic_frame.from_options(...)
filtered_frame = Filter.apply(frame = dynamic_frame, f = lambda x: x["status"] == "active")
Verification
How do you know it's actually fixed? Check these three areas:
- CloudWatch Metrics: Monitor
glue.driver.jvm.heap.usage. If it stays below 70-80%, your job is healthy. If it spikes to 95% and stays there, you are still at risk. - Spark UI: Open the Spark History Server. Look for a single task that takes much longer than others; this confirms you still have a data skew problem.
- Log Status: The job should finish with a
Succeededstatus and no "Exit Code 1" in the logs.
Pro-tips for Long-term Stability
- Enable Auto-scaling: Let Glue manage the worker count. It will add nodes when memory pressure rises and drop them when the job winds down.
--enable-auto-scaling: true
- Stop using .collect(): This is the #1 cause of driver OOMs. If you need to see data, use
df.show(10)instead of pulling the whole set into the driver. - Pushdown Predicates: Filter your data at the S3 level so you never load unnecessary rows into memory.

