Kops滚动更新失败，主节点“集群未通过验证”

Question

For some reason my master node can no longer connect to my cluster after upgrading from kubernetes 1.11.9 to 1.12.9 via kops (version 1.13.0). 由于某种原因，在通过kops（版本1.13.0）从kubernetes 1.11.9升级到1.12.9之后，我的主节点无法再连接到群集。 In the manifest I'm upgrading kubernetesVersion from 1.11.9 -> 1.12.9. 在清单中，我升级kubernetesVersion > 1.12.9 -从1.11.9。 This is the only change I'm making. 这是我唯一要做的更改。 However when I run kops rolling-update cluster --yes I get the following error: 但是，当我运行kops rolling-update cluster --yes是时，出现以下错误：

Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-01234567" has not yet joined cluster.
Cluster did not validate within 5m0s

After that if I run a kubectl get nodes I no longer see that master node in my cluster. 之后，如果我运行kubectl get nodes我将不再在群集中看到该主节点。

Doing a little bit of debugging by sshing into the disconnected master node instance I found the following error in my api-server log by running sudo cat /var/log/kube-apiserver.log : 通过切入断开连接的主节点实例进行一些调试，我通过运行sudo cat /var/log/kube-apiserver.log在api服务器日志中发现了以下错误：

controller.go:135] Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: connect: connection refused

Anyone have any idea how I can get my master node back in the cluster or have advice on things to try? 任何人都知道如何将主节点重新加入群集或对尝试的事情有建议吗？

Answer 1

I have made some research I got few ideas for you: 我进行了一些研究，但对您的想法却很少：

If there is no output for the etcd grep it means that your etcd server is down. 如果etcd grep没有输出，则意味着您的etcd服务器已关闭。 Check the logs for the 'Exited' etcd container | grep Exited | grep etcd 检查“已退出” etcd容器的日志| grep Exited | grep etcd | grep Exited | grep etcd | grep Exited | grep etcd and than logs <etcd-container-id> | grep Exited | grep etcd然后logs <etcd-container-id>
Try this instruction I found: 尝试我发现的以下指令：

1 - I removed the old master from de etcd cluster using etcdctl. 1-我使用etcdctl从de etcd集群中删除了旧的master。 You will need to connect on the etcd-server container to do this. 您将需要在etcd-server容器上进行连接。

2 - On the new master node I stopped kubelet and protokube services. 2-在新的主节点上，我停止了kubelet和protokube服务。

3 - Empty Etcd data dir. 3-空Etcd数据目录。 (data and data-events) （数据和数据事件）

4 - Edit /etc/kubernetes/manifests/etcd.manifests and etcd-events.manifest changing ETCD_INITIAL_CLUSTER_STATE from new to existing. 4-编辑/etc/kubernetes/manifests/etcd.manifests和etcd-events.manifest，将ETCD_INITIAL_CLUSTER_STATE从新更改为现有。

5 - Get the name and PeerURLS from new master and use etcdctl to add the new master on the cluster. 5-从新的主服务器获取名称和PeerURLS，并使用etcdctl在群集上添加新的主服务器。 (etcdctl member add "name" "PeerULR")You will need to connect on the etcd-server container to do this. （etcdctl成员添加“名称”“ PeerULR”）您将需要在etcd服务器容器上进行连接。

6 - Start kubelet and protokube services on the new master. 6-在新的主服务器上启动kubelet和protokube服务。

If that is not the case than you might have a problem with the certs. 如果不是这种情况，则说明您的证书可能有问题。 They are provisioned during the creation of the cluster and some of them have the allowed master's endpoints. 它们是在集群创建期间提供的，其中一些具有允许的主节点端点。 If that is the case you'd need to create new certs and roll them for the api server/etcd clusters. 如果是这种情况，则需要创建新的证书并将其滚动到api服务器/ etcd群集。

Please let me know if that helped. 请让我知道是否有帮助。

Kops滚动更新失败，主节点“集群未通过验证”

问题描述

1 个解决方案

解决方案1
1 2019-08-16 12:55:03

Kops滚动更新失败，主节点“集群未通过验证”

问题描述

1 个解决方案

解决方案1 1 2019-08-16 12:55:03

解决方案1
1 2019-08-16 12:55:03