Docker swarm leave --force - context deadline exceeded

Question

I'm following the Docker tutorials here https://docs.docker.com/get-started/part3/

When I get execute the command docker swarm leave --force near the end of the pages tutorial I keep getting a Error response from daemon: context deadline exceeded

Every subsequent time I do the docker swarm leave --force command the terminal appears to just hang, it doesn't provide the error message anymore, but it doesn't return to the prompt for me to enter any commands unless I do a CTRL+C.

The docker swarm init command at the beginning of the linked tutorial also is unresponsive when its in this state.

The only time the docker swarm commands work again is if I close out my VM instance and restart it. But when I follow the steps again from the link I get the same error on the docker swarm leave --force command

Any ideas why its doing this?

I'm running Ubuntu 18.04.1 LTS in Virtual Box, with docker version 18.09.0-rc1, build 6e632f7 .

I saw this other link Cannot leave swarm mode about the same issue, it is 2 years old and the answers there appear to be work arounds or full out remove Docker completely and reinstall to get it working. I'm hoping that there is another way to fix this.

Answer 1

What works for me with failing managers is not restarting the whole node, but stopping the docker service, removing the /var/lib/docker/swarm directory, restarting docker service and then readding the manager:

On manager-failing (the failing manager):

sudo systemctl stop docker
sudo rm -r /var/lib/docker/swarm
sudo systemctl start docker

On manager-working (other, functioning manager):

docker node demote manager-failing
docker node rm manager-failing
ssh manager-failing $(docker swarm join-token manager | tail -2)

Answer 2

Well, I have some good and bad news for you.

I have faced the same issue in 2016-2017 while building a large experimental docker swarm environment. We were building a multi region docker swarm cluster with dns load balancing. This was a 50+ node swarm cluster.
At one time our ceph storage cluster crashed and took a lot of the swarm nodes down with it. When all the nodes came back online I was experiencing the same issues as you describe.

The good news:
What worked for me was stopping the docker service, reboot, restart docker. All the services running on the cluster magically reappeared as if nothing has happened.

The bad news:
This worked on most of the nodes. Some swarm masters never recovered. These nodes I simply destroyed and I just spinned up new nodes to add to the swarm.

EDIT: I have dug out some old scripts that I used for swarm recovery.

To restore a failed swarm manager you should first make a backup of the configuration and spin up a new instance.

 mkdir /root/Backup
 cp -rf /var/lib/docker/swarm /root/Backup
 cp /root/Backup
 tar -czvf swarm.tar.gz swarm/
 scp -r user@new_host:/tmp

On the new host restore the config

cp swarm.tar /var/lib/docker
tar -xvf swarm.tar

Drain your worker nodes

docker node update -availability drain [node]

Update all your running services

docker service update --force

Docker swarm leave --force - context deadline exceeded

Question

2 answers

solution1
3 2020-11-16 11:37:37

solution2
1 2018-11-09 15:49:48

Docker swarm leave --force - context deadline exceeded

Question

2 answers

solution1 3 2020-11-16 11:37:37

solution2 1 2018-11-09 15:49:48

solution1
3 2020-11-16 11:37:37

solution2
1 2018-11-09 15:49:48