简体   繁体   English

Kubernetes Calico 节点 'XXXXXXXXXXX' 已经在使用 IPv4 地址 XXXXXXXXX,CrashLoopBackOff

[英]Kubernetes Calico node 'XXXXXXXXXXX' already using IPv4 Address XXXXXXXXX, CrashLoopBackOff

I used the AWS Kubernetes Quickstart to create a Kubernetes cluster in a VPC and private subnet: https://aws-quickstart.s3.amazonaws.com/quickstart-heptio/doc/heptio-kubernetes-on-the-aws-cloud.pdf .我使用 AWS Kubernetes 快速入门在 VPC 和私有子网中创建 Kubernetes 集群: https://aws-quickstart.s3.amazonaws.com/quickstart-heptio/doc/heptio-kubernetes-on-the-aws-cloud。 .pdf It was running fine for a while.它运行了一段时间。 I have Calico installed on my Kubernetes cluster.我的 Kubernetes 集群上安装了 Calico。 I have two nodes and a master.我有两个节点和一个主节点。 The calico pods on the master are running fine, the ones on the nodes are in crashloopbackoff state:主节点上的 calico pod 运行良好,节点上的 calico pod 处于 crashloopbackoff 状态:

NAME                                                               READY     STATUS             RESTARTS   AGE
calico-etcd-ztwjj                                                  1/1       Running            1          55d
calico-kube-controllers-685755779f-ftm92                           1/1       Running            2          55d
calico-node-gkjgl                                                  1/2       CrashLoopBackOff   270        22h
calico-node-jxkvx                                                  2/2       Running            4          55d
calico-node-mxhc5                                                  1/2       CrashLoopBackOff   9          25m

Describing one of the crashed pods:描述其中一个坠毁的吊舱:

ubuntu@ip-10-0-1-133:~$ kubectl describe pod calico-node-gkjgl -n kube-system
Name:           calico-node-gkjgl
Namespace:      kube-system
Node:           ip-10-0-0-237.us-east-2.compute.internal/10.0.0.237
Start Time:     Mon, 17 Sep 2018 16:56:41 +0000
Labels:         controller-revision-hash=185957727
                k8s-app=calico-node
                pod-template-generation=1
Annotations:    scheduler.alpha.kubernetes.io/critical-pod=
Status:         Running
IP:             10.0.0.237
Controlled By:  DaemonSet/calico-node
Containers:
  calico-node:
    Container ID:   docker://d89979ba963c33470139fd2093a5427b13c6d44f4c6bb546c9acdb1a63cd4f28
    Image:          quay.io/calico/node:v3.1.1
    Image ID:       docker-pullable://quay.io/calico/node@sha256:19fdccdd4a90c4eb0301b280b50389a56e737e2349828d06c7ab397311638d29
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 18 Sep 2018 15:14:44 +0000
      Finished:     Tue, 18 Sep 2018 15:14:44 +0000
    Ready:          False
    Restart Count:  270
    Requests:
      cpu:      250m
    Liveness:   http-get http://:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  http-get http://:9099/readiness delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ETCD_ENDPOINTS:                     <set to the key 'etcd_endpoints' of config map 'calico-config'>  Optional: false
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       kubeadm,bgp
      CALICO_DISABLE_FILE_LOGGING:        true
      CALICO_K8S_NODE_REF:                 (v1:spec.nodeName)
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      CALICO_IPV4POOL_CIDR:               192.168.0.0/16
      CALICO_IPV4POOL_IPIP:               Always
      FELIX_IPV6SUPPORT:                  false
      FELIX_IPINIPMTU:                    1440
      FELIX_LOGSEVERITYSCREEN:            info
      IP:                                 autodetect
      FELIX_HEALTHENABLED:                true
    Mounts:
      /lib/modules from lib-modules (ro)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-cni-plugin-token-b7sfl (ro)
  install-cni:
    Container ID:  docker://b37e0ec7eba690473a4999a31d9f766f7adfa65f800a7b2dc8e23ead7520252d
    Image:         quay.io/calico/cni:v3.1.1
    Image ID:      docker-pullable://quay.io/calico/cni@sha256:dc345458d136ad9b4d01864705895e26692d2356de5c96197abff0030bf033eb
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Running
      Started:      Mon, 17 Sep 2018 17:11:52 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 17 Sep 2018 16:56:43 +0000
      Finished:     Mon, 17 Sep 2018 17:10:53 +0000
    Ready:          True
    Restart Count:  1
    Environment:
      CNI_CONF_NAME:       10-calico.conflist
      ETCD_ENDPOINTS:      <set to the key 'etcd_endpoints' of config map 'calico-config'>      Optional: false
      CNI_NETWORK_CONFIG:  <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-cni-plugin-token-b7sfl (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
  calico-cni-plugin-token-b7sfl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-cni-plugin-token-b7sfl
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoSchedule
                 :NoExecute
                 :NoSchedule
                 :NoExecute
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason   Age                  From                                               Message
  ----     ------   ----                 ----                                               -------
  Warning  BackOff  4m (x6072 over 22h)  kubelet, ip-10-0-0-237.us-east-2.compute.internal  Back-off restarting failed container

The logs for the same pod:同一 pod 的日志:

ubuntu@ip-10-0-1-133:~$ kubectl logs calico-node-gkjgl -n kube-system -c calico-node
2018-09-18 15:14:44.605 [INFO][8] startup.go 251: Early log level set to info
2018-09-18 15:14:44.605 [INFO][8] startup.go 269: Using stored node name from /var/lib/calico/nodename
2018-09-18 15:14:44.605 [INFO][8] startup.go 279: Determined node name: ip-10-0-0-237.us-east-2.compute.internal
2018-09-18 15:14:44.609 [INFO][8] startup.go 101: Skipping datastore connection test
2018-09-18 15:14:44.610 [INFO][8] startup.go 352: Building new node resource Name="ip-10-0-0-237.us-east-2.compute.internal"
2018-09-18 15:14:44.610 [INFO][8] startup.go 367: Initialize BGP data
2018-09-18 15:14:44.614 [INFO][8] startup.go 564: Using autodetected IPv4 address on interface ens3: 10.0.0.237/19
2018-09-18 15:14:44.614 [INFO][8] startup.go 432: Node IPv4 changed, will check for conflicts
2018-09-18 15:14:44.618 [WARNING][8] startup.go 861: Calico node 'ip-10-0-0-237' is already using the IPv4 address 10.0.0.237.
2018-09-18 15:14:44.618 [WARNING][8] startup.go 1058: Terminating
Calico node failed to start

So it seems like there is a conflict finding the node IP address, or Calico seems to think the IP is already assigned to another node.因此,查找节点 IP 地址似乎存在冲突,或者 Calico 似乎认为 IP 已分配给另一个节点。 Doing a quick search i found this thread: https://github.com/projectcalico/calico/issues/1628 .快速搜索我发现了这个线程: https ://github.com/projectcalico/calico/issues/1628。 I see that this should be resolved by setting the IP_AUTODETECTION_METHOD to can-reach=DESTINATION, which I'm assuming would be "can-reach=10.0.0.237".我看到这应该通过将 IP_AUTODETECTION_METHOD 设置为 can-reach=DESTINATION 来解决,我假设它是“can-reach=10.0.0.237”。 This config is an environment variable set on calico/node container.此配置是在 calico/node 容器上设置的环境变量。 I have been attempting to shell into the container itself, but kubectl tells me the container is not found:我一直在尝试进入容器本身,但 kubectl 告诉我找不到容器:

ubuntu@ip-10-0-1-133:~$ kubectl exec calico-node-gkjgl --stdin --tty /bin/sh -c calico-node -n kube-system
error: unable to upgrade connection: container not found ("calico-node")

I'm suspecting this is due to Calico being unable to assign IPs.我怀疑这是由于 Calico 无法分配 IP。 So I logged onto the host and attempt to shell on the container using docker:所以我登录到主机并尝试使用 docker 在容器上运行:

root@ip-10-0-0-237:~# docker exec -it k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1 /bin/bash
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "exec: \"/bin/bash\": stat /bin/bash: no such file or directory"

So I guess there is no shell to execute in the container.所以我猜容器中没有要执行的shell。 Makes sense why Kubernetes couldn't execute that. Kubernetes 无法执行的原因是有道理的。 I tried running commands externally to list environment variables, but I haven't been able to find any, I could be running these commands wrong however:我尝试在外部运行命令以列出环境变量,但我找不到任何命令,但是我可能会错误地运行这些命令:

root@ip-10-0-0-237:~# docker inspect -f '{{range $index, $value := .Config.Env}}{{$value}} {{end}}' k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

root@ip-10-0-0-237:~# docker exec -it k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1 printenv IP_AUTODETECTION_METHOD
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "exec: \"printenv\": executable file not found in $PATH"

root@ip-10-0-0-237:~# docker exec -it k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1 /bin/env
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "exec: \"/bin/env\": stat /bin/env: no such file or directory"

Okay, so maybe I am going about this the wrong way.好的,所以也许我正在以错误的方式解决这个问题。 Should I attempt to change the Calico config files using Kubernetes and redeploy it?我应该尝试使用 Kubernetes 更改 Calico 配置文件并重新部署它吗? Where can I find these on my system?我在哪里可以在我的系统上找到这些? I haven't been able to find where to set the environment variables.我一直无法找到在哪里设置环境变量。

If you look at the Calico docs IP_AUTODETECTION_METHOD is already defaulting to first-round .如果您查看Calico 文档IP_AUTODETECTION_METHOD已经默认为first-round

My guess is that something or the IP address is not being released by the previous 'run' of calico, or just simply a bug in the v3.1.1 version of calico.我的猜测是之前的“运行”calico 没有释放某些东西或 IP 地址,或者只是 calico 的v3.1.1版本中的一个错误。

Try:尝试:

  1. Delete your Calico pods that are in a CrashBackOff loop删除 CrashBackOff 循环中的 Calico pod

     kubectl -n kube-system delete calico-node-gkjgl calico-node-mxhc5

    Your pods will be re-created and hopefully initialize.您的 pod 将被重新创建并希望初始化。

  2. Upgrade Calico to v3.1.3 or latest.将 Calico 升级到v3.1.3或最新版本。 Follow these docs My guess is that Heptio's Calico installation is using the etcd datastore.按照这些文档我的猜测是 Heptio 的 Calico 安装正在使用 etcd 数据存储。

  3. Try to understand how Heptio's AWS AMIs work and see if there are any issues with them.尝试了解 Heptio 的 AWS AMI 的工作原理,看看它们是否存在任何问题。 This might take some time so you could contact their support as well.这可能需要一些时间,因此您也可以联系他们的支持。

  4. Try a different method to install Kubernetes with Calico.尝试不同的方法来使用 Calico 安装 Kubernetes。 Well documented on https://kubernetes.iohttps://kubernetes.io上有很好的记录

For me what worked was to remove left over docker-networks on the Nodes.对我来说,有效的是删除节点上剩余的 docker-networks。
I had to list out current networks on each Node: docker network list and then remove the unneeded ones: docker network rm <networkName> .我必须列出每个节点上的当前网络: docker network list ,然后删除不需要的网络: docker network rm <networkName>
After doing that the calico deployment pods were running fine之后,印花布部署吊舱运行良好

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM