Fix Ansible 'SSH Error: data could not be sent to remote host' When SSH Connection Drops Mid-Run

The Situation

Your playbook starts fine — host reachability passes, tasks begin executing — then somewhere in the middle, everything dies with:

fatal: [webserver-01]: FAILED! => {"msg": "SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh"}

The host is up. You can SSH into it manually right now. But Ansible can't finish what it started. That's the infuriating part — this isn't an unreachable host error. The connection dropped during execution.

Why This Happens

Ansible multiplexes SSH connections through ControlMaster — one master socket shared across multiple task operations. When a long-running task stalls the SSH pipeline (idle timeout, firewall RST packets, buffer overflow), the underlying socket dies while Ansible still thinks it's alive. The next write to that dead socket triggers this error.

Common triggers:

A firewall or NAT device killing idle TCP connections — common with AWS Security Groups, GCP firewall rules, or corporate proxies that reset after 350–600 seconds of inactivity
Tasks that run longer than the SSH server's ClientAliveInterval without any keepalive traffic
Large data transfers overwhelming the SSH pipe buffer
DNS resolution failures for the remote host mid-session
The remote host's sshd hitting resource limits and dropping connections
Ansible's ControlMaster socket going stale between tasks in a long playbook

Quick Fix — Get the Playbook Running Right Now

Disable ControlMaster persistence and add keepalives. Add this to your playbook or inventory:

# In your playbook
- hosts: all
  vars:
    ansible_ssh_extra_args: '-o ServerAliveInterval=30 -o ServerAliveCountMax=10 -o ControlMaster=no'

Or set it per host in your inventory:

[webservers]
webserver-01 ansible_ssh_extra_args="-o ServerAliveInterval=30 -o ServerAliveCountMax=10 -o ControlMaster=no"

Then rerun from the failure point instead of starting over:

ansible-playbook site.yml --start-at-task="The task that failed" -v

ServerAliveInterval=30 sends a keepalive every 30 seconds. Most firewalls that kill idle connections wait at least 60 seconds, so this keeps the socket warm. ControlMaster=no forces a fresh SSH connection per task, bypassing any stale socket that might be lingering from a previous crash.

Diagnosing the Root Cause

Before locking in a permanent fix, identify which trigger you're dealing with.

Check if it's a keepalive / firewall issue

Run the playbook with verbose SSH debugging:

ANSIBLE_SSH_ARGS="-vvv" ansible-playbook site.yml 2>&1 | grep -E "(debug|channel|packet|timeout|Broken)"

channel X: open failed or Broken pipe in the output means the pipe died. Connection timed out points to a network or firewall problem upstream.

Test the SSH keepalive behavior manually

ssh -o ServerAliveInterval=30 -o ServerAliveCountMax=10 user@host "sleep 300 && echo done"

This command holds an SSH session open for 5 minutes doing nothing useful — exactly what Ansible does during long tasks. Complete successfully here but fail in your playbook? The SSH config fix below is your answer. Drops here too? You've got a deeper network problem to chase.

Check sshd logs on the remote host

sudo journalctl -u sshd --since "30 minutes ago" | grep -E "(disconnect|timeout|error)"
# or on older systems:
grep -i "disconnect\|timeout\|error" /var/log/auth.log | tail -50

Timeout, client not responding appearing from the server side means the disconnection was server-initiated — the client wasn't sending keepalives and sshd gave up.

Permanent Fix — ansible.cfg

Set these globally so every playbook benefits without repeating ssh args everywhere:

[defaults]
timeout = 30

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o ServerAliveInterval=30 -o ServerAliveCountMax=10
pipelining = True
control_path_dir = /tmp/.ansible/cp

Key settings explained:

ServerAliveInterval=30 — client sends a keepalive every 30 seconds
ServerAliveCountMax=10 — tolerate up to 10 missed keepalives (5 minutes total) before declaring the connection dead
ControlPersist=60s — master connection stays alive 60 seconds after last use, not indefinitely; cuts stale socket risk significantly
pipelining = True — batches SSH operations per task, fewer round-trips, fewer opportunities for the connection to drop mid-task

If you're hitting firewall RST packets (AWS/GCP/Azure)

Cloud provider firewalls typically reset connections idle for more than 350–600 seconds. Configure sshd on the remote host to send keepalives from its side too:

# /etc/ssh/sshd_config
ClientAliveInterval 30
ClientAliveCountMax 10
TCPKeepAlive yes

sudo systemctl reload sshd

Relying only on client-side keepalives puts all the responsibility on Ansible. When both sides send them, either end can detect a dead connection and neither has to wait for a full timeout.

If pipelining causes issues with sudo

Some sudo configurations require a TTY, which conflicts with pipelining. Getting sudo: no tty present after enabling it? Fix it on the remote host:

# Add to /etc/sudoers via visudo:
Defaults !requiretty

Clearing Stale ControlMaster Sockets

Crashed playbooks leave dead sockets behind. Those stale files trigger the same error on your next run even after the network is fine:

ls /tmp/.ansible/cp/
rm -f /tmp/.ansible/cp/*

Or surgically kill a specific socket rather than nuking the whole directory:

ansible all -m ping  # Still failing? Sockets are stale.
ssh -O stop -o ControlPath=/tmp/.ansible/cp/%r@%h:%p user@host

Verifying the Fix

After applying the ansible.cfg changes, test with a deliberately slow task before trusting it with production:

ansible webservers -m command -a "sleep 60 && echo 'connection survived'"

connection survived after 60 seconds means your keepalives are working. Then run the full playbook with logging:

ansible-playbook site.yml -v 2>&1 | tee playbook-run.log

Scan the output for SSH Error. No hits means the fix held.

Tips

When this error shows up inconsistently across a large fleet — some hosts fail, others don't — the problem is usually subnet-level. Different hosts travel different network paths through different firewall rules. Before chasing per-host SSH configs, map out your network topology first. I reach for the Subnet Calculator at ToolCraft to quickly figure out which CIDR ranges different hosts fall into and whether they'd hit different firewall policies. It's browser-only, no data sent anywhere.

For long-running tasks like database migrations or package installs, consider async and poll instead of holding the SSH connection open the whole time:

- name: Run long database migration
  command: python manage.py migrate
  async: 600    # Allow up to 10 minutes
  poll: 15      # Check every 15 seconds

Ansible fires the task, disconnects, then polls for completion on a schedule. The SSH pipe never stays open long enough to drop — which sidesteps this entire problem for operations that routinely run more than a minute or two.