Fixing Redis 'CLUSTERDOWN Hash slot not served' when a slot lacks a serving node in Redis Cluster

Context: The Midnight Call – 'CLUSTERDOWN Hash slot not served'

It's 2 AM, and your pager just screamed. A critical application is down, and the logs are spewing CLUSTERDOWN Hash slot not served errors from Redis. This isn't just a warning; it means your Redis Cluster is in a failed state, and some or all data operations are halted for keys mapping to the unserved slots. The cluster can't guarantee data availability or consistency for those slots, so it stops serving requests entirely.

This error typically manifests when one or more of the 16384 hash slots that make up a Redis Cluster are not assigned to any primary node. Common scenarios leading to this include:

A primary node failing permanently without a healthy replica to take over, or automatic failover didn't complete successfully.
A primary node being improperly removed from the cluster, leaving its slots orphaned.
An incomplete or failed resharding operation.
A cluster starting up with insufficient nodes to cover all slots.

Your immediate goal is to identify which slots are unassigned and then re-assign them to healthy primary nodes to bring the cluster back to an OK state.

Debug Process: Pinpointing the Orphaned Slots

First, we need to confirm the cluster's state and identify the problematic slots. Connect to any node in your cluster using redis-cli.

1. Check the Cluster State

The most basic check will immediately tell you if the cluster is healthy.

redis-cli -h <any_cluster_ip> -p <any_cluster_port> cluster info

Look for the cluster_state and cluster_slots_assigned fields. You'll likely see something like this:

cluster_state:fail
cluster_slots_assigned:16380
cluster_slots_ok:16380
cluster_slots_pfail:0
cluster_slots_fail:4
...

The key here is cluster_state:fail and cluster_slots_assigned being less than 16384 (the total number of slots). cluster_slots_fail will indicate how many slots are unassigned.

2. Identify Unassigned Slots and Failed Nodes

Next, let's see which slots are actually causing the problem and if any nodes are marked as failed.

redis-cli -h <any_cluster_ip> -p <any_cluster_port> cluster slots

This command will list all slot ranges and the primary/replica nodes responsible for them. You'll be looking for entries that don't have a primary node associated with them, or where the primary node is showing a fail or PFAIL status.

Also, check the status of individual nodes:

redis-cli -h <any_cluster_ip> -p <any_cluster_port> cluster nodes

Look for lines indicating nodes with fail or PFAIL flags. Note their IDs and IP:Port. If a primary node is marked as fail and its slots are not covered by an active replica, those are your orphaned slots.

3. Validate with `redis-cli --cluster check`

The redis-cli --cluster check utility is invaluable for a deeper dive:

redis-cli --cluster check <any_cluster_ip>:<any_cluster_port>

This command will perform a comprehensive check of the cluster's health, including slot coverage, node connectivity, and replication status. It often provides specific recommendations or highlights the exact slots that are not served.

# Example output indicating missing slots
[ERR] Node <node_id> is not part of the cluster, but still has some slots. Please use CLUSTER RESET to fix this.
[ERR] Missing 4 slots.
[ERR] The following slots are not covered by any node: 1024-1027
!!! Some of the cluster hash slots are not covered. Rebalance is advised. !!!

Solution: Reclaiming the Unserved Slots

Now that we know the problem, let's fix it. The approach depends on whether the original node that served the slots is recoverable or permanently lost.

Scenario A: The Node is Temporarily Down or Recoverable

If the node that previously owned the unserved slots is merely down (e.g., due to a reboot, network glitch), the simplest solution is to bring it back online. Once the node rejoins the cluster, it should reclaim its slots, and the cluster state should return to OK.

# Example: Restarting a Redis service
sudo systemctl start redis-server@<port> # On systemd systems
# Or manually start the redis-server process if not using a service manager

Monitor cluster info and cluster nodes after restart. If the node comes back and reclaims its slots, you're good.

Scenario B: The Node is Permanently Lost or Needs Manual Intervention

If the primary node is gone for good, or if bringing it back online didn't resolve the issue (e.g., its data was corrupted), you need to reassign its slots to other healthy primary nodes. This is where redis-cli --cluster comes in handy.

Option 1: Using `redis-cli --cluster fix` (Older Versions/Simpler Cases)

For simpler cases, especially in older Redis versions, fix can sometimes resolve minor inconsistencies, including unassigned slots, by attempting to re-establish a consistent state.

redis-cli --cluster fix <any_cluster_ip>:<any_cluster_port>

This command tries to automatically fix common cluster issues. It might not always work for deeply fragmented clusters or permanently lost primaries, but it's a good first attempt.

Option 2: Resharding Slots to Existing Nodes (Most Common Fix)

This is the most robust way to reassign orphaned slots. You'll distribute the unassigned slots among the remaining healthy primary nodes.

redis-cli --cluster reshard <any_cluster_ip>:<any_cluster_port>

The command will prompt you for several pieces of information:

How many slots do you want to move (from 1 to 16384)? Enter the number of unassigned slots you found from cluster info or cluster check. For example, if cluster_slots_fail:4, enter 4.
What is the receiving node ID? This is the ID of an existing healthy primary node that will take on these new slots. You can get node IDs from redis-cli cluster nodes. Choose a node that isn't already overloaded.
Source node ID? Since these slots are unassigned (orphaned), you'll likely enter all to indicate that the slots can be taken from any source that currently holds them (even if that source is 'none' for unassigned slots). If prompted for specific node IDs, and the slots were from a failed node, you might need to enter the ID of the failed node if it's still known to the cluster, or simply use all. For truly unassigned slots, all is usually the correct approach.

Confirm the plan, and the resharding process will begin. It will move the data associated with those slots (if any was recovered or existed) to the new primary node.

Option 3: Adding a New Primary Node and Resharding

If your cluster is already under capacity or if you want to maintain an even distribution of slots and data, consider adding a new primary node first, then resharding the unassigned slots to it.

Step 1: Add the new node as a primary

redis-cli --cluster add-node <new_node_ip>:<new_node_port> <any_existing_cluster_ip>:<any_existing_cluster_port> --cluster-master

This adds the new node as a primary without any slots initially.

Step 2: Reshard slots to the new primary

redis-cli --cluster reshard <any_existing_cluster_ip>:<any_existing_cluster_port>

When prompted:

How many slots do you want to move? Enter the number of unassigned slots.
What is the receiving node ID? Enter the ID of the new primary node you just added.
Source node ID? Enter all.

Verification Steps: Confirming the Fix

Once you've attempted a fix, it's crucial to verify the cluster is fully operational.

1. Check Cluster State Again

redis-cli -h <any_cluster_ip> -p <any_cluster_port> cluster info

You should now see:

cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
...

2. Verify Slot Coverage

redis-cli --cluster check <any_cluster_ip>:<any_cluster_port>

This should report no unassigned slots and a healthy cluster structure.

3. Test Data Operations

Perform some basic read and write operations to ensure data can be stored and retrieved correctly across different keys, which will hit various slots.

redis-cli -h <any_cluster_ip> -p <any_cluster_port> SET mykey "hello from fixed cluster"
redis-cli -h <any_cluster_ip> -p <any_cluster_port> GET mykey
redis-cli -h <any_cluster_ip> -p <any_cluster_port> SET anotherkey "another value"
redis-cli -h <any_cluster_ip> -p <any_cluster_port> GET anotherkey

If all these checks pass, congratulations! Your Redis Cluster is back in business.

Lessons Learned & Prevention

Robust Monitoring: Implement comprehensive monitoring for Redis Cluster, not just individual node health. Track cluster_state, cluster_slots_assigned, and node flags (fail, PFAIL). Alerting on a non-ok cluster state is critical.
Proper Node Removal: Never just kill a primary node without properly migrating its slots or ensuring its replicas can failover. Use redis-cli --cluster del-node after ensuring slots are moved/replicated.
Replication Factor: Ensure a sufficient replication factor (at least 1, meaning each primary has at least one replica) so that automatic failover can occur if a primary node goes down.
Automated Backups: Regular RDB snapshots or AOF persistence are your last line of defense against data loss in catastrophic scenarios.
Understanding Slot Distribution: When manually dealing with slot ranges or trying to understand key distribution in a cluster, sometimes a quick hash calculation can be useful to verify which slot a specific key should belong to. For that, I often use browser-based tools like ToolCraft's Hash Generator – it's handy for quick checks without sending data anywhere, especially when you need to confirm a key's slot or debug why a key might not be hitting an expected node during a complex resharding operation.

Fixing Redis 'CLUSTERDOWN Hash slot not served' when a slot lacks a serving node in Redis Cluster

Context: The Midnight Call – 'CLUSTERDOWN Hash slot not served'

Debug Process: Pinpointing the Orphaned Slots

1. Check the Cluster State

2. Identify Unassigned Slots and Failed Nodes

3. Validate with `redis-cli --cluster check`

Solution: Reclaiming the Unserved Slots

Scenario A: The Node is Temporarily Down or Recoverable

Scenario B: The Node is Permanently Lost or Needs Manual Intervention

Option 1: Using `redis-cli --cluster fix` (Older Versions/Simpler Cases)

Option 2: Resharding Slots to Existing Nodes (Most Common Fix)

Option 3: Adding a New Primary Node and Resharding

Verification Steps: Confirming the Fix

1. Check Cluster State Again

2. Verify Slot Coverage

3. Test Data Operations

Lessons Learned & Prevention

Related Error Notes

Fixing Redis Error: (error) ERR The server is running without a config file during CONFIG REWRITE

Fixing Redis Error: (error) ERR wrong number of arguments for 'set' command

Fixing the Redis 'ERR bit offset is not an integer or out of range' Error

Context: The Midnight Call – 'CLUSTERDOWN Hash slot not served'

Debug Process: Pinpointing the Orphaned Slots

1. Check the Cluster State

2. Identify Unassigned Slots and Failed Nodes

3. Validate with redis-cli --cluster check

Solution: Reclaiming the Unserved Slots

Scenario A: The Node is Temporarily Down or Recoverable

Scenario B: The Node is Permanently Lost or Needs Manual Intervention

Option 1: Using redis-cli --cluster fix (Older Versions/Simpler Cases)

Option 2: Resharding Slots to Existing Nodes (Most Common Fix)

Option 3: Adding a New Primary Node and Resharding

Verification Steps: Confirming the Fix

1. Check Cluster State Again

2. Verify Slot Coverage

3. Test Data Operations

Lessons Learned & Prevention

Related Error Notes

Fixing Redis Error: (error) ERR The server is running without a config file during CONFIG REWRITE

Fixing Redis Error: (error) ERR wrong number of arguments for 'set' command

Fixing the Redis 'ERR bit offset is not an integer or out of range' Error

3. Validate with `redis-cli --cluster check`

Option 1: Using `redis-cli --cluster fix` (Older Versions/Simpler Cases)