Fixing Kubernetes 'dial tcp: i/o timeout' caused by Network Policy

intermediate☸️ Kubernetes2026-05-23| Kubernetes 1.24+ with a CNI plugin that supports NetworkPolicy (Calico, Cilium, Weave, or Azure CNI)

Error Message

dial tcp <pod-ip>:<port>: i/o timeout
#kubernetes#network-policy#networking#timeout#pod-to-pod

The Silent Connection Killer

Three hours. That's how long I spent debugging a microservice that couldn't reach its database. The logs were flooded with one specific error, but nothing on the application side seemed wrong. Pods were running. The service was reachable. DNS resolved fine.

dial tcp 10.244.1.45:5432: i/o timeout

In Kubernetes, an i/o timeout means the packet left but never got a reply β€” or it was dropped silently mid-route. That's the key difference from connection refused, which means the port is closed and the target actively said "no." A timeout usually points to a firewall or a Network Policy eating packets without sending a TCP Reset back.

How I Debugged the Timeout

For pod-to-pod timeouts, I run through a short checklist to narrow down the cause. Start simple, then work inward.

1. Verify the Pod is Actually Listening

The first check: is the target pod alive and listening on that port at all? I spun up a netshoot debug pod in the same namespace and hit the IP directly β€” no DNS, no service layer in between.

kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
# Inside the pod:
nc -zv 10.244.1.45 5432

Another timeout. Same problem. That ruled out application code entirely β€” this was a network layer issue.

2. Check for Existing Network Policies

Next, I checked for any policies applied to the namespace.

kubectl get networkpolicy -n production

There it was: a policy named default-deny-ingress. And that's the gotcha many developers miss: the moment a pod is selected by any NetworkPolicy, it enters implicit deny mode for all traffic not explicitly listed. Someone had locked down the namespace for security but never added an exception for my new service.

The Fix: Write a Specific Allow Rule

Don't delete the deny policy. That's a security regression. The right move is to write a narrow allow rule that opens only the exact traffic you need.

The "Allow Ingress" Policy

I wrote a NetworkPolicy targeting the database pod (the receiver) to permit traffic from the API pod on port 5432.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-db-access
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: postgres
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: web-api
    ports:
    - protocol: TCP
      port: 5432

Don't Forget Egress

Timeouts can happen on the sending side too. If your source pod has an Egress policy, it might be blocked from initiating the connection in the first place. My API pod did β€” its egress rules were just as restrictive. I added this snippet:

# Partial snippet for the API's Egress policy
egress:
- to:
  - podSelector:
      matchLabels:
        app: postgres
  ports:
  - protocol: TCP
    port: 5432

Verifying the Fix

Back in the netshoot pod, one more test:

$ nc -zv 10.244.1.45 5432
Connection to 10.244.1.45 5432 port [tcp/postgresql] succeeded!

Instant success. No more timeout.

What I Changed Going Forward

Kubernetes networking is invisible until it breaks. After this incident, I made three habits stick:

  • Standardize pod labels: Every pod gets app, role, and env labels from day one. NetworkPolicies can only select what they can find.
  • Document the deny policy: A default deny is the right call for production. But you need a clear onboarding checklist so new services don't silently time out on their first deploy.
  • Verify CIDR ranges before committing: When policies involve external IPs or specific subnets, I double-check ranges with the Subnet Calculator on ToolCraft. Mistyping a CIDR block β€” like writing /24 when you meant /32 β€” can accidentally allow 254 extra IPs or block an entire segment.

Still seeing dial tcp: i/o timeout after all this? Dig into the CNI logs on the worker nodes (Calico or Cilium both have per-node logs). Occasionally the culprit is a stale iptables rule or a routing table mismatch at the node level β€” rare, but it happens in long-running clusters after partial upgrades.

Related Error Notes