Kubernetes service stops after one node is down

Question

I am setting up a small Kubernetes cluster using a VM (master) and 3 bare metal servers (all running Ubuntu 14.04). I followed the Kubernetes install tutorial for Ubuntu . Each bare metal server also has 2T of disk space exported using Ceph 0.94.5 . Everything was working fine, but when one node failed to start (it wasn't able to mount a partition) the only service the cluster was providing also stopped working. I run some commands:

$ kubectl get nodes
NAME        LABELS                             STATUS
10.70.2.1   kubernetes.io/hostname=10.70.2.1   Ready,SchedulingDisabled
10.70.2.2   kubernetes.io/hostname=10.70.2.2   Ready
10.70.2.3   kubernetes.io/hostname=10.70.2.3   NotReady
10.70.2.4   kubernetes.io/hostname=10.70.2.4   Ready

It just showed that I had a node down.

$ kubectl get pods
NAME               READY     STATUS    RESTARTS   AGE
java-mysql-5v7iu   1/1       Running   1          5d
java-site-vboaq    1/1       Running   0          4d

$ kubectl get services
NAME         LABELS                                    SELECTOR          IP(S)          PORT(S)
java-mysql   name=java-mysql                           name=java-mysql   ***.***.3.12   3306/TCP
java-site    name=java-site                            name=java-site    ***.***.3.11   80/TCP
kubernetes   component=apiserver,provider=kubernetes   <none>            ***.***.3.1    443/TCP

It showed all pods and services working fine. However, I could not connect to one of the pods ( java-site-vboaq ):

$ kubectl exec java-site-vboaq -i -t -- bash
error: Error executing remote command: Error executing command in container: container not found ("java-site")

But, the pods weren't even running on the downed node:

$ kubectl describe pod java-mysql-5v7iu
Image(s):           mysql:5
Node:               10.70.2.2/10.70.2.2
Status:             Running

$ kubectl describe pod java-site-vboaq
Image(s):           javasite-img
Node:               10.70.2.2/10.70.2.2
Status:             Running

After the downed node ( 10.70.2.3 ) was back, everything went back to normal.

How do I fix this problem? If a node is out, I want Kubernetes to migrate pods accordingly and keep the services working. Does it have to do with the fact that the downed node was stuck in the boot (waiting for a partition to mount) and not 100% down?

Answer 1

A few potential problems here: 1) Ceph needs its nodes to be up and running to be accessible: did you say the nodes were mounting disk from a different Ceph cluster, or is the Ceph cluster running on the same nodes? If the same nodes, then it makes sense that the drive not being accessible paralyzes K8s.

2) There is a bug (at least it was there on 1.0.6 and not sure if it was resolved) about pods not being able to start when trying to mount a disk that was already mounted on a different node, because it was never unmounted. This is a K8s issue to follow (sorry can't seem to find the link right now)

3) etcd may also get stuck waiting for node 3 if it only has 2 nodes, as it needs a majority to elect a master.

Answer 2

When a node goes down, kubernetes does not immediately treat the pods on that node as dead. It waits for 5 minutes, before declaring them dead.

So, if your node rebooted, and then did not come back up the point where kubelet was running, then any pod on that node would appear present and alive according to kubectl for 5 minutes after the reboot started, but would in fact be dead.

Kubernetes service stops after one node is down

Question

2 answers

solution1
1 2015-11-24 18:51:59

solution2
0 2015-12-01 19:36:05

Kubernetes service stops after one node is down

Question

2 answers

solution1 1 2015-11-24 18:51:59

solution2 0 2015-12-01 19:36:05

solution1
1 2015-11-24 18:51:59

solution2
0 2015-12-01 19:36:05