Fix 'Neighbour table overflow: ARP table full' Causing Network Drops in Large Networks

The 2 AM Wake-Up Call

Monitoring fires off. Users can't reach the app. SSH to half your servers times out. You check the obvious stuff — interfaces are up, routes look fine — then you spot it buried in dmesg:

[ 4823.119472] neighbour: arp_cache: neighbor table overflow!
[ 4823.119512] Neighbour table overflow: ARP table full

Your kernel ran out of space in its ARP table. New ARP entries can't be created, so packets to hosts the server hasn't seen recently get dropped silently. In a flat /20 or larger network with hundreds of active hosts, this happens fast — especially on gateways, load balancers, or monitoring boxes that talk to everything.

What's Actually Happening Under the Hood

The Linux kernel keeps a neighbor table (the ARP cache) to map IP addresses to MAC addresses. By default, it's sized for small networks. Three kernel parameters control the limits:

gc_thresh1 — below this, no garbage collection runs (default: 128)
gc_thresh2 — soft limit; GC kicks in after 5 seconds if exceeded (default: 512)
gc_thresh3 — hard limit; no new entries allowed past this point (default: 1024)

On a busy server sitting in a /22 network (~1000 active hosts), you hit gc_thresh3 in minutes. The kernel starts silently dropping packets to any host it can't ARP for, which is why everything looks fine on paper but nothing actually works.

Confirm It Before You Change Anything

Run these commands to verify the ARP table is full and causing the drops:

# Check current ARP table size
ip neigh show | wc -l

# Check current kernel limits
cat /proc/sys/net/ipv4/neigh/default/gc_thresh1
cat /proc/sys/net/ipv4/neigh/default/gc_thresh2
cat /proc/sys/net/ipv4/neigh/default/gc_thresh3

# Look for the overflow message in recent kernel logs
dmesg | grep -i 'neighbour\|arp' | tail -20
journalctl -k | grep -i 'neighbour table overflow' | tail -10

# Check ARP cache stats (failed lookups)
netstat -s | grep -i 'arp\|fail'

If ip neigh show | wc -l returns a number near or above gc_thresh3, that's your culprit. You'll likely also see entries stuck in FAILED state:

ip neigh show | grep FAILED | head -20

Immediate Fix (No Reboot Required)

Raise the thresholds now to stop the packet drops:

sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=4096
sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=8192
sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=16384

Also seeing ndisc_cache overflows? Cover IPv6 too:

sudo sysctl -w net.ipv6.neigh.default.gc_thresh1=4096
sudo sysctl -w net.ipv6.neigh.default.gc_thresh2=8192
sudo sysctl -w net.ipv6.neigh.default.gc_thresh3=16384

Target roughly 2–3× your expected unique host count per segment. A /20 network has 4094 usable addresses — set gc_thresh3 to at least 8192. A /16 with thousands of active VMs warrants 32768 or higher.

Make It Survive a Reboot

Those sysctl -w changes vanish on reboot. Write them to a persistent config file:

sudo tee /etc/sysctl.d/99-arp-table.conf <<EOF
net.ipv4.neigh.default.gc_thresh1 = 4096
net.ipv4.neigh.default.gc_thresh2 = 8192
net.ipv4.neigh.default.gc_thresh3 = 16384
net.ipv6.neigh.default.gc_thresh1 = 4096
net.ipv6.neigh.default.gc_thresh2 = 8192
net.ipv6.neigh.default.gc_thresh3 = 16384
EOF

sudo sysctl --system

Verify the Fix Worked

# Confirm new limits are active
sysctl net.ipv4.neigh.default.gc_thresh3

# ARP table should now be well below the limit
ip neigh show | wc -l

# No more overflow messages in kernel log
dmesg | grep -i 'neighbour table overflow'

# FAILED entries clear on their own, but you can flush them immediately
ip neigh flush nud failed
ip neigh flush nud stale

Connectivity typically restores within a few seconds of applying the sysctl changes. If hosts are still unreachable, flush the failed ARP entries manually — the kernel will re-ARP and populate fresh ones right away.

Optional: Tune Garbage Collection for Dynamic Environments

Running lots of VMs or containers that spin up and down constantly? The default GC settings were designed for static hosts. Tighten them up:

# How long before an unused entry goes stale
# Default is 60 seconds — 30 works better when IPs recycle frequently
sudo sysctl -w net.ipv4.neigh.default.base_reachable_time_ms=30000

# GC runs every N seconds (default: 30, fine to leave as-is)
sudo sysctl -w net.ipv4.neigh.default.gc_interval=30

# How long a stale entry lingers before GC removes it
sudo sysctl -w net.ipv4.neigh.default.gc_stale_time=60

The Root Cause You Should Actually Fix

Raising the threshold buys time. But ask yourself why you have that many ARP entries in the first place:

Flat network too large — a /20 stretched across a single Layer 2 domain is a design problem. Break it into smaller VLANs with a routed core. Each segment stays in its own broadcast and ARP domain, so no single server needs to know about 4000 hosts.
ARP scan or monitoring tool — Nagios, Zabbix, or a custom network scanner hitting every IP in a /20 range will flood the ARP table in minutes. Check what's running broad pings across the segment.
Broadcast storm — run tcpdump -i eth0 arp | pv -l -r > /dev/null to see the live ARP request rate. Hundreds per second means you have a broadcast amplification problem, not just a table size problem.
Kubernetes or container overlay — large clusters exhaust ARP tables fast because each pod gets its own IP. Apply the same sysctl fix, but also check whether your CNI (Calico, Flannel, Cilium) has its own neighbor table configuration that needs tuning.

Network Planning Tip

When you're splitting a large flat network into smaller VLANs to reduce ARP pressure, getting the subnet math right matters. I use the Subnet Calculator on ToolCraft to work out CIDR ranges, usable host counts, and broadcast addresses when planning VLAN splits — it runs entirely in the browser, nothing leaves your machine, which matters when you're working with internal IP schemes.

What to Take Away From This

The default 1024-entry ARP limit made sense decades ago. Any modern flat network larger than a /22 needs these values tuned at provisioning time — not discovered at 2 AM during an outage.
Add ARP table saturation to your monitoring. A single metric — ip neigh show | wc -l divided by gc_thresh3 — gives you an early warning before it hits production.
Drop these sysctl values into your Ansible playbook or cloud-init config. The second time you chase this at 2 AM is entirely avoidable.
If the overflow keeps recurring after you've raised the threshold, you're solving the wrong problem. That's a network architecture issue, not a Linux tuning issue.