简体   繁体   English

kube-apiserver docker 不断重启

[英]kube-apiserver docker is restarting continuously

Sincere apologies for this lengthy posting.对这篇冗长的帖子表示诚挚的歉意。

I have a 4 node Kubernetes cluster with 1 x master and 3 x worker nodes.我有一个 4 节点的 Kubernetes 集群,有 1 个主节点和 3 个工作节点。 I connect to the kubernetes cluster using kubeconfig, since yesterday I was not able to connect using kubeconfig.我使用 kubeconfig 连接到 kubernetes 集群,因为昨天我无法使用 kubeconfig 连接。

kubectl get pods was giving an error "The connection to the server api.xxxxx.xxxxxxxx.com was refused - did you specify the right host or port?" kubectl get pods给出错误“与服务器 api.xxxxx.xxxxxxxx.com 的连接被拒绝 - 您是否指定了正确的主机或端口?”

In the kubeconfig server name is specified as https://api.xxxxx.xxxxxxxx.com在 kubeconfig 服务器名称中指定为https://api.xxxxx.xxxxxxxx.com

Note:笔记:

Please note as there were too many https links, I was not able to post the question.请注意,由于 https 链接太多,我无法发布问题。 So I have renamed https:// to https:-- to avoid the links in the background analysis section.所以我将 https:// 重命名为 https:-- 以避免背景分析部分中的链接。

I tried to run kubectl from the master node and received similar error The connection to the server localhost:8080 was refused - did you specify the right host or port?我尝试从主节点运行kubectl并收到类似错误 The connection to the server localhost:8080 was kubectl - 您指定了正确的主机或端口吗?

Then checked kube-apiserver docker and it was continuously exiting / Crashloopbackoff.然后检查了 kube-apiserver docker,它一直在退出 / Crashloopbackoff。

docker logs <container-id of kube-apiserver> shows below errors docker logs <container-id of kube-apiserver>显示以下错误

W0914 16:29:25.761524 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0 }. W0914 16:29:25.761524 1 clientconn.go:1251] grpc: addrConn.createTransport 无法连接到 {127.0.0.1:4001 0 }。 Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid".错误:连接错误:desc =“传输:身份验证握手失败:x509:证书已过期或尚未有效”。 Reconnecting... F0914 16:29:29.319785 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry {[https://127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc000266d80 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err (context deadline exceeded)正在重新连接... F0914 16:29:29.319785 1 storage_decorator.go:57] 无法创建存储后端:config (&{etcd3 /registry {[https://127.0.0.1:4001] /etc/kubernetes/pki/kube -apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc000266d80 apiextensions.k8s.io/ v1beta1 5m0s 1m0s}),错误(超出上下文截止日期)

systemctl status kubelet --> was giving below errors systemctl status kubelet --> 给出以下错误

Sep 14 16:40:49 ip-xxx-xxx-xx-xx kubelet[2411]: E0914 16:40:49.693576 2411 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal?timeout=10s : dial tcp 127.0.0.1:443: connect: connection refused Sep 14 16:40:49 ip-xxx-xxx-xx-xx kubelet[2411]: E0914 16:40:49.693576 2411 kubelet_node_status.go:385] 更新节点状态时出错,将重试:获取节点“ip-xxx-”时出错xxx-xx-xx.xx-xxxxx-1.compute.internal”:获取https://127.0.0.1/api/v1/nodes/ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute。 internal?timeout=10s : dial tcp 127.0.0.1:443: connect: 连接被拒绝

Note: ip-xxx-xx-xx-xxx --> internal IP address of aws ec2 instance.注意:ip-xxx-xx-xx-xxx --> aws ec2 实例的内部 IP 地址。

Background Analysis:背景分析:

Looks there was some issue with the cluster on 7th Sep 2020 and both kube-controller and kube-scheduler dockers exited and restarted.看起来 2020 年 9 月 7 日的集群出现了一些问题,kube-controller 和 kube-scheduler docker 都退出并重新启动。 I believe since then kube-apiserver is not running or because of kube-apiserver, those dockers restarted.我相信从那时起 kube-apiserver 就没有运行,或者因为 kube-apiserver,那些 docker 重新启动了。 The kube-apiserver server certificate expired in July 2020 but access via kubectl was working until 7th Sep. kube-apiserver 服务器证书于 2020 年 7 月到期,但通过 kubectl 的访问一直工作到 9 月 7 日。

Below are the docker logs from the exited kube-scheduler docker container:以下是docker logs from the exited kube-scheduler容器的docker logs from the exited kube-scheduler

I0907 10:35:08.970384 1 scheduler.go:572] pod default/k8version-1599474900-hrjcn is bound successfully on node ip-xx-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4 nodes evaluated, 3 nodes were found feasible I0907 10:40:09.286831 1 scheduler.go:572] pod default/k8version-1599475200-tshlx is bound successfully on node ip-1x-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4 nodes evaluated, 3 nodes were found feasible I0907 10:44:01.935373 I0907 10:35:08.970384 1 scheduler.go:572] pod default/k8version-1599474900-hrjcn 在节点 ip-xx-xx-xx-xx.xx-xxxxxx-x.compute.internal 上成功绑定,评估了 4 个节点,发现 3 个节点可行 I0907 10:40:09.286831 1 scheduler.go:572] pod default/k8version-1599475200-tshlx 在节点 ip-1x-xx-xx-xx.xx-xxxxxx-x.compute.internal 上绑定成功, 评估了 4 个节点,发现 3 个节点可行 I0907 10:44:01.935373
1 leaderelection.go:263] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded E0907 10:44:01.935420 1 server.go:252] lost master lost lease 1 leaderelection.go:263] 未能更新租约 kube-system/kube-scheduler: 未能 tryAcquireOrRenew 上下文截止日期超过 E0907 10:44:01.935420 1 server.go:252] 丢失主丢失租约

Below are the docker logs from exited kube-controller docker container:以下是退出的 kube-controller docker 容器的 docker 日志:

I0907 10:40:19.703485 1 garbagecollector.go:518] delete object [v1/Pod, namespace: default, name: k8version-1599474300-5r6ph, uid: 67437201-f0f4-11ea-b612-0293e1aee720] with propagation policy Background I0907 10:44:01.937398 1 leaderelection.go:263] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded E0907 10:44:01.937506 I0907 10:40:19.703485 1garbagecollector.go:518] 删除对象 [v1/Pod,命名空间:默认,名称:k8version-1599474300-5r6ph,uid:67437201-f0f4-11ea-b612-07029 背景传播策略 I0907 :44:01.937398 1 leaderelection.go:263] 未能更新租约 kube-system/kube-controller-manager: 未能 tryAcquireOrRenew 上下文截止日期超过 E0907 10:44:01.937506
1 leaderelection.go:306] error retrieving resource lock kube-system/kube-controller-manager: Get https: --127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I0907 10:44:01.937456 1 event.go:209] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"ba172d83-a302-11e9-b612-0293e1aee720", APIVersion:"v1", ResourceVersion:"85406287", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-xxx-xx-xx-xxx_1dd3c03b-bd90-11e9-85c6-0293e1aee720 stopped leading F0907 10:44:01.937545 1 controllermanager.go:260] leaderelection lost I0907 10:44:01.949274 1 leaderelection.go:306] 检索资源锁时出错 kube-system/kube-controller-manager: Get https: --127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout= 10 秒:net/http:请求被取消(Client.Timeout 在等待标头时超出) I0907 10:44:01.937456 1 event.go:209] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"ba172d83-a302-11e9-b612-0293e1aee720", APIVersion:"v1", ResourceVersion:"85406287", FieldPath:""}): type: 'Normal' 原因: 'LeaderElection' ip-xxx-xx-xx-xxx_1dd3c03b-bd90-11e9-85c6-0293e1aee720 停止领先 F0907 10:44:01.937545 1 controllermanager.go:260] leaderelection lost I07:2497
1 range_allocator.go:169] Shutting down range CIDR allocator I0907 10:44:01.949285 1 replica_set.go:194] Shutting down replicaset controller I0907 10:44:01.949291 1 gc_controller.go:86] Shutting down GC controller I0907 10:44:01.949304 1 pvc_protection_controller.go:111] Shutting down PVC protection controller I0907 10:44:01.949310 1 route_controller.go:125] Shutting down route controller I0907 10:44:01.949316 1 service_controller.go:197] Shutting down service controller I0907 10:44:01.949327 1 deployment_controller.go:164] Shutting down deployment controller I0907 10:44:01.949435 1 garbagecollector.go:148] Shutting down garbage collector controller I0907 10:44:01.949443 1 resource_quota_controller.go:295] Shutting down resource quota controller 1 range_allocator.go:169] 关闭范围 CIDR 分配器 I0907 10:44:01.949285 1 replica_set.go:194] 关闭副本集控制器 I0907 10:44:01.949291 1 gc_controller.go:874 I Shu009 down :01.949304 1 pvc_protection_controller.go:111] 关闭 PVC 保护控制器 I0907 10:44:01.949310 1 route_controller.go:125] 关闭路由控制器 I0907 10:44:01.949317 service_controller.107 关闭 service_controller.1070 :44:01.949327 1 deployment_controller.go:164] 正在关闭部署控制器 I0907 10:44:01.949435 1garbagecollector.go:148] 正在关闭垃圾收集器控制器 I0907 10:44:01.949443 1 resource_quotgottinga2 resource_5控制器

Below are the docker logs from kube-controller since the restart (7th Sep):以下是自重启(9 月 7 日)以来 kube-controller 的 docker 日志:

E0915 21:51:36.028108 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-controller-manager: Get https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:51:40.133446 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-controller-manager: Get https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:51:36.028108 1 leaderelection.go:306] 检索资源锁时出错 kube-system/kube-controller-manager: Get https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube- controller-manager?timeout=10s: dial tcp 127.0.0.1:443: connect: connection denied E0915 21:51:40.133446 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-controller-manager: Get https: --127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 127.0.0.1:443: connect: 连接被拒绝

Below are the docker logs from kube-scheduler since the restart (7th Sep):以下是自重启(9 月 7 日)以来 kube-scheduler 的 docker 日志:

E0915 21:52:44.703587 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Node: Get https://127.0.0.1/api/v1/nodes?limit=500&resourceVersion=0 : dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.704504 E0915 21:52:44.703587 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: 无法列出 *v1.Node: 获取https://127.0.0.1/api/v1/ nodes?limit=500&resourceVersion=0 : dial tcp 127.0.0.1:443: connect: 连接被拒绝 E0915 21:52:44.704504
1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.ReplicationController: Get https:--127.0.0.1/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.705471 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Get https:--127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.706477 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.ReplicaSet: Get https:--127.0.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.707581 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.StorageClass: Get https:--127.0.0.1/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.708599 1 ref 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: 无法列出 *v1.ReplicationController: Get https:--127.0.0.1/api/v1/replicationcontrollers?limit=500&resourceVersion= 0: 拨号 tcp 127.0.0.1:443: 连接: 连接被拒绝 E0915 21:52:44.705471 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: 无法列出 *v1.Service : 获取 https:--127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: 连接被拒绝 E0915 21:52:44.706477 1 reflector.go:126] k8s.io /client-go/informers/factory.go:133: 无法列出 *v1.ReplicaSet: Get https:--127.0.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0: dial tcp 127.0.0.1: 443: 连接: 连接被拒绝 E0915 21:52:44.707581 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: 无法列出 *v1.StorageClass: Get https:--127.0. 0.1/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: 连接被拒绝 E0915 21:52:44.708599 1 ref lector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.PersistentVolume: Get https:--127.0.0.1/api/v1/persistentvolumes?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.709687 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.StatefulSet: Get https:--127.0.0.1/apis/apps/v1/statefulsets?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.710744 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.PersistentVolumeClaim: Get https:--127.0.0.1/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.711879 1 reflector.go:126] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:223: Failed to list *v1.Pod: Get https:--127.0.0.1/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: dial tcp 127.0.0. lector.go:126] k8s.io/client-go/informers/factory.go:133: 无法列出 *v1.PersistentVolume: Get https:--127.0.0.1/api/v1/persistentvolumes?limit=500&resourceVersion=0 : dial tcp 127.0.0.1:443: connect: 连接被拒绝 E0915 21:52:44.709687 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: 无法列出 *v1.StatefulSet:获取 https:--127.0.0.1/apis/apps/v1/statefulsets?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection denied E0915 21:52:44.710744 1 reflector.go:1.26] k8s io/client-go/informers/factory.go:133: 无法列出 *v1.PersistentVolumeClaim: Get https:--127.0.0.1/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443 : 连接: 连接被拒绝 E0915 21:52:44.711879 1 reflector.go:126] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:223: 无法列出 *v1.Pod: 获取 https:- -127.0.0.1/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0:拨tcp 127.0.0。 1:443: connect: connection refused E0915 21:52:44.712903 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.PodDisruptionBudget: Get https:--127.0.0.1/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused 1:443: 连接: 连接被拒绝 E0915 21:52:44.712903 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: 无法列出 *v1beta1.PodDisruptionBudget: 获取 https:-- 127.0.0.1/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0:拨打tcp 127.0.0.1:443:connect:连接被拒绝

kube-apiserver certificate Renewal: kube-apiserver 证书更新:

I found the kube-apiserver certificate which is this one /etc/kubernetes/pki/kube-apiserver/etcd-client.crt had expired in July 2020. There were few other expired certificates related to etcd-manager-main and events (it is same copy of the certificates on both places) but I don't see this referenced in the manifest files.我找到了 kube-apiserver 证书,它是/etc/kubernetes/pki/kube-apiserver/etcd-client.crt已于 2020 年 7 月过期。与 etcd-manager-main 和事件相关的其他过期证书很少(它是两个地方证书的相同副本),但我没有在清单文件中看到这一点。

I searched and found steps to renew the certificates but most of them were using "kubeadm init phase" commands but I couldn't find kubeadm on master server and the certificates names and paths were different to my setup.我搜索并找到了更新证书的步骤,但其中大多数都使用“kubeadm init phase”命令,但我在主服务器上找不到 kubeadm 并且证书名称和路径与我的设置不同。 So I generated a new certificate using openssl for kube-apiserver using existing ca cert and included DNS names with internal and external IP address (ec2 instance) and loopback ip address using openssl.cnf file.因此,我使用现有的 ca 证书为 kube-apiserver 使用 openssl 生成了一个新证书,并使用 openssl.cnf 文件包含具有内部和外部 IP 地址(ec2 实例)和环回 IP 地址的 DNS 名称。 I replaced the new certificate with the same name /etc/kubernetes/pki/kube-apiserver/etcd-client.crt .我用相同的名称替换了新证书/etc/kubernetes/pki/kube-apiserver/etcd-client.crt

After that I restarted the kube-apiserver docker (which was continuously exiting) and restarted kubelet.之后,我重新启动了 kube-apiserver docker(不断退出)并重新启动了 kubelet。 Now the certificate expiry message is not coming but the kube-apiserver is continuously restarting which I believe is the reason for the errors on kube-controller and kube-scheduler docker containers.现在证书到期消息没有出现,但 kube-apiserver 不断重新启动,我认为这是 kube-controller 和 kube-scheduler docker 容器出现错误的原因。

NOTE:笔记:

I have not restarted the docker on the master server after replacing the certificate.更换证书后,我没有在主服务器上重新启动docker。

NOTE: All our production PODs are running on worker nodes so they are not affected but I can't manage them as I can't connect using kubectl.注意:我们所有的生产 POD 都在工作节点上运行,因此它们不受影响,但我无法管理它们,因为我无法使用 kubectl 进行连接。

Now, I am not sure what is the issue and why kube-apiserver is restarting continuously.现在,我不确定是什么问题以及为什么 kube-apiserver 不断重启。

Update to the original question:更新原始问题:

Kubernetes version: v1.14.1 Docker version: 18.6.3 Kubernetes 版本:v1.14.1 Docker 版本:18.6.3

Below are the latest docker logs from kube-apiserver container (which is still crashing)以下是docker logs from kube-apiserver container的最新docker logs from kube-apiserver container (仍在崩溃)

F0916 08:09:56.753538 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry {[https:--127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc00095f050 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err (tls: private key does not match public key) F0916 08:09:56.753538 1 storage_decorator.go:57] 无法创建存储后端:config (&{etcd3 /registry {[https:--127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd -client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc00095f050 apiextensions.k8s.io/v1beta1 5m0s 1m0s} ), err (tls: 私钥与公钥不匹配)

Below is the output from systemctl status kubelet以下是systemctl status kubelet的输出

Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.095615 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x.compute.internal" not found 9 月 16 日 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.095615 388 kubelet.go:2244] 节点“ip-xxx-xx-xx-xx.xx-xxxxx- x.compute.internal”未找到

Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.130377 388 kubelet.go:2170] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.130377 388 kubelet.go:2170] 容器运行时网络未就绪:NetworkReady=false 原因:NetworkPluginNotReady 消息:docker:网络插件未准备好:Kubenet 没有 netConfig。 This is most likely due to lack of PodCIDR这很可能是由于缺少 PodCIDR

Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.147390 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https:--127.0.0.1/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused 9 月 16 日 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.147390 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133:无法列出 *v1beta1.CSIDriver: Get https:--127.0.0.1/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection denied

Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.195768 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found 9 月 16 日 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.195768 388 kubelet.go:2244] 节点“ip-xxx-xx-xx-xx.xx-xxxxx- x..compute.internal”未找到

Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.295890 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found 9 月 16 日 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.295890 388 kubelet.go:2244] 节点“ip-xxx-xx-xx-xx.xx-xxxxx- x..compute.internal”未找到

Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.347431 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://127.0.0.1/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0 : dial tcp 127.0.0.1:443: connect: connection refused 9 月 16 日 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.347431 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133:无法列出 *v1beta1.RuntimeClass: Get https://127.0.0.1/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0 : dial tcp 127.0.0.1:443: connect: connection denied

This cluster (along with 3 others) was setup using kops.这个集群(连同其他 3 个)是使用 kops 设置的。 The other clusters are running normally and looks like they have some expired certificates as well.其他集群运行正常,看起来也有一些过期的证书。 The person who setup the clusters is not available for comment and I have limited experience on Kubernetes.设置集群的人无法发表评论,我对 Kubernetes 的经验有限。 Hence required assistance from the gurus.因此需要大师的帮助。

Any help is very much appreciated.很感谢任何形式的帮助。

Many thanks.非常感谢。

Update after response from Zambozo and Nepomucen: Zambozo 和 Nepomucen 回复后的更新:

Thanks to both of you for your response.感谢你们两位的回应。 Based that I found that there were expired etcd certificates on the /mnt mount point.基于我发现 /mnt 挂载点上有过期的 etcd 证书。

I followed workaround from https://kops.sigs.k8s.io/advisories/etcd-manager-certificate-expiration/我遵循了https://kops.sigs.k8s.io/advisories/etcd-manager-certificate-expiration/ 中的解决方法

and recreated etcd certificates and keys.并重新创建了 etcd 证书和密钥。 I have verified each of the certificate with a copy of the old one (from my backup folder) and everything is matching and the new certificates has expiry date set to Sep 2021.我已经使用旧证书的副本(来自我的备份文件夹)验证了每个证书,一切都匹配,新证书的到期日期设置为 2021 年 9 月。

Now I am getting different error on etcd dockers (both etcd-manager-events and etcd-manager-main)现在我在 etcd dockers 上遇到不同的错误(etcd-manager-events 和 etcd-manager-main)

Note:xxx-xx-xx-xxx is the IP address of the master server注:xxx-xx-xx-xxx为主服务器的IP地址

root@ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-main container> --tail 20 I0916 14:41:40.349570 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a" W0916 14:41:40.351857 8221 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure I0916 14:41:40.351878 8221 peers.go:347] was not able to connect to peer etcd-a: map[xxx.xx.xx.xxx:3996:true] W0916 14:41:40.351887 8221 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-a I0916 14:41:41.205763 8221 controller.go:173] starting controller iteration W0916 14:41:41.205801 8221 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-a" in list of peers [] I0916 14:41:45.352008 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a" I0916 14:41:46.678314 8221 volumes.go:85] root@ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-main container> --tail 20 I0916 14:41:40.349570 8221 peers.go:281] 连接到对等方“etcd-a” TLS 策略,servername="etcd-manager-server-etcd-a" W0916 14:41:40.351857 8221 peers.go:325] 无法 grpc-ping 发现对等方 xxx.xx.xx.xxx:3996: rpc 错误:代码= 不可用 desc = 所有 SubConns 都处于 TransientFailure I0916 14:41:40.351878 8221 peers.go:347] 无法连接到对等方 etcd-a: map[xxx.xx.xx.xxx:3996:true] W0916 14: 41:40.351887 8221 peers.go:215] 来自对等互连的意外错误:无法连接到对等 etcd-a I0916 14:41:41.205763 8221 controller.go:173] 开始控制器迭代 W0916 14:41:411.802 控制器。 :149] 运行 etcd 集群协调循环的意外错误:在对等方列表中找不到自我“etcd-a”[] I0916 14:41:45.352008 8221 peers.go:281] 使用 TLS 策略连接到对等方“etcd-a”, servername="etcd-manager-server-etcd-a" I0916 14:41:46.678314 8221 volumes.go:85] AWS API Request: ec2/DescribeVolumes I0916 14:41:46.739272 8221 volumes.go:85] AWS API Request: ec2/DescribeInstances I0916 14:41:46.786653 8221 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-a.internal.xxxxx.xxxxxxx.com etcd-a.internal.xxxxx.xxxxxxx.com]] I0916 14:41:46.786724 8221 hosts.go:181] skipping update of unchanged /etc/hosts AWS API 请求:ec2/DescribeVolumes I0916 14:41:46.739272 8221volumes.go:85] AWS API 请求:ec2/DescribeInstances I0916 14:41:46.786653 8221 hosts.go:84] hosts 更新:primary=s[ =map[etcd-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-a. internal.xxxxx.xxxxxxxx.com etcd-a.internal.xxxxx.xxxxxxxx.com]] I0916 14:41:46.786724 8221 hosts.go:181] 跳过未更改的 /etc/hosts 的更新

root@ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-events container> --tail 20 W0916 14:42:40.294576 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a I0916 14:42:41.106654 8316 controller.go:173] starting controller iteration W0916 14:42:41.106692 8316 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-events-a" in list of peers [] I0916 14:42:45.294682 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a" W0916 14:42:45.297094 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure I0916 14:42:45.297117 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true] I0916 14:42:46.791923 8316 volumes.go:85] AWS API Request: ec2/DescribeVolumes I0916 14:42:46.856548 8316 volumes.go:85] AWS API R root@ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-events container> --tail 20 W0916 14:42:40.294576 8316 peers.go:215] 来自对等互连的意外错误:无法连接对等 etcd-events-a I0916 14:42:41.106654 8316 controller.go:173] 开始控制器迭代 W0916 14:42:41.106692 8316 controller.go:149] 运行 etcd 集群协调循环的意外错误:找不到自我“etcd对等点列表中的事件-a” [] I0916 14:42:45.294682 8316 peers.go:281] 使用 TLS 策略连接到对等点“etcd-events-a”,servername="etcd-manager-server-etcd-events- a" W0916 14:42:45.297094 8316 peers.go:325] 无法 grpc-ping 发现对等 xxx.xx.xx.xxx:3997:rpc 错误:代码 = 不可用 desc = 所有 SubConns 都处于 TransientFailure I0916 14:42: 45.297117 8316 peers.go:347] 无法连接到对等 etcd-events-a: map[xxx.xx.xx.xxx:3997:true] I0916 14:42:46.791923 8316 volumes.go:85] AWS API请求:ec2/DescribeVolumes I0916 14:42:46.856548 8316 volumes.go:85] AWS API R equest: ec2/DescribeInstances I0916 14:42:46.945119 8316 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-events-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-events-a.internal.xxxxx.xxxxxxx.com etcd-events-a.internal.xxxxx.xxxxxxx.com]] I0916 14:42:50.297264 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a" W0916 14:42:50.300328 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure I0916 14:42:50.300348 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true] W0916 14:42:50.300360 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a请求:ec2/DescribeInstances I0916 14:42:46.945119 8316 hosts.go:84] 主机更新:primary=map[], fallbacks=map[etcd-events-a.internal.xxxxx.xxxxxxx.com:[xxx.xx. xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-events-a.internal.xxxxx.xxxxxxx.com etcd-events-a.internal.xxxxx.xxxxxxx .com]] I0916 14:42:50.297264 8316 peers.go:281] 使用 TLS 策略连接到对等方“etcd-events-a”,servername="etcd-manager-server-etcd-events-a" W0916 14:42 :50.300328 8316 peers.go:325] 无法 grpc-ping 发现的 peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure I0916 14:42:50.30034s.go:3916 peer 347] 无法连接到对等方 etcd-events-a: map[xxx.xx.xx.xxx:3997:true] W0916 14:42:50.300360 8316 peers.go:215] 来自对等方互通的意外错误:无法连接到对等 etcd-events-a

Could you please suggest on how to proceed from here?你能建议如何从这里开始吗?

Many thanks.非常感谢。

I think this is related to the ETCD.我认为这与ETCD有关。 you may have renewed the certs for Kubernetes components but did you do the same for ETCD?您可能已经为 Kubernetes 组件更新了证书,但您是否为 ETCD 做了同样的事情? Your API server is trying to connect to the ETCD and giving:您的 API 服务器正在尝试连接到 ETCD 并提供:

tls: private key does not match public key)

As you have only 1 etcd(assuming on the number of master nodes) I would do a backup of it before trying to fix it.由于您只有 1 个 etcd(假设主节点的数量),我会在尝试修复它之前对其进行备份。

Generating a new cert using openssl for kube-apiserver and replacing the cert and key brought the kube-apiserver docker to stable state and provided access via kubectl.使用 openssl 为 kube-apiserver 生成新证书并替换证书和密钥使 kube-apiserver docker 进入稳定状态并通过 kubectl 提供访问。

To resolve etcd-manager certs issue, upgraded etcd-manager to kopeio/etcd-manager:3.0.20200531 for both etcd-manager-main and etcd-manager-events as described at https://github.com/kubernetes/kops/issues/8959#issuecomment-673515269为了解决 etcd-manager 证书问题,将 etcd-manager 升级到kopeio/etcd-manager:3.0.20200531用于 etcd-manager-main 和 etcd-manager-events,如https://github.com/kubernetes/kops/ 所述问题/8959#issuecomment-673515269

Thank you谢谢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM