简体   繁体   English

Kops滚动更新失败,主节点“集群未通过验证”

[英]Kops rolling-update fails with “Cluster did not pass validation” for master node

For some reason my master node can no longer connect to my cluster after upgrading from kubernetes 1.11.9 to 1.12.9 via kops (version 1.13.0). 由于某种原因,在通过kops(版本1.13.0)从kubernetes 1.11.9升级到1.12.9之后,我的主节点无法再连接到群集。 In the manifest I'm upgrading kubernetesVersion from 1.11.9 -> 1.12.9. 在清单中,我升级kubernetesVersion > 1.12.9 -从1.11.9。 This is the only change I'm making. 这是我唯一要做的更改。 However when I run kops rolling-update cluster --yes I get the following error: 但是,当我运行kops rolling-update cluster --yes是时,出现以下错误:

Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-01234567" has not yet joined cluster.
Cluster did not validate within 5m0s

After that if I run a kubectl get nodes I no longer see that master node in my cluster. 之后,如果我运行kubectl get nodes我将不再在群集中看到该主节点。

Doing a little bit of debugging by sshing into the disconnected master node instance I found the following error in my api-server log by running sudo cat /var/log/kube-apiserver.log : 通过切入断开连接的主节点实例进行一些调试,我通过运行sudo cat /var/log/kube-apiserver.log在api服务器日志中发现了以下错误:

controller.go:135] Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: connect: connection refused

I suspect the issue might be related to etcd, because when I run sudo netstat -nap | grep LISTEN | grep etcd 我怀疑问题可能与etcd有关,因为当我运行sudo netstat -nap | grep LISTEN | grep etcd sudo netstat -nap | grep LISTEN | grep etcd sudo netstat -nap | grep LISTEN | grep etcd there is no output. sudo netstat -nap | grep LISTEN | grep etcd没有输出。

Anyone have any idea how I can get my master node back in the cluster or have advice on things to try? 任何人都知道如何将主节点重新加入群集或对尝试的事情有建议吗?

I have made some research I got few ideas for you: 我进行了一些研究,但对您的想法却很少:

  1. If there is no output for the etcd grep it means that your etcd server is down. 如果etcd grep没有输出,则意味着您的etcd服务器已关闭。 Check the logs for the 'Exited' etcd container | grep Exited | grep etcd 检查“已退出” etcd容器的日志| grep Exited | grep etcd | grep Exited | grep etcd | grep Exited | grep etcd and than logs <etcd-container-id> | grep Exited | grep etcd然后logs <etcd-container-id>

  2. Try this instruction I found: 尝试我发现的以下指令

1 - I removed the old master from de etcd cluster using etcdctl. 1-我使用etcdctl从de etcd集群中删除了旧的master。 You will need to connect on the etcd-server container to do this. 您将需要在etcd-server容器上进行连接。

2 - On the new master node I stopped kubelet and protokube services. 2-在新的主节点上,我停止了kubelet和protokube服务。

3 - Empty Etcd data dir. 3-空Etcd数据目录。 (data and data-events) (数据和数据事件)

4 - Edit /etc/kubernetes/manifests/etcd.manifests and etcd-events.manifest changing ETCD_INITIAL_CLUSTER_STATE from new to existing. 4-编辑/etc/kubernetes/manifests/etcd.manifests和etcd-events.manifest,将ETCD_INITIAL_CLUSTER_STATE从新更改为现有。

5 - Get the name and PeerURLS from new master and use etcdctl to add the new master on the cluster. 5-从新的主服务器获取名称和PeerURLS,并使用etcdctl在群集上添加新的主服务器。 (etcdctl member add "name" "PeerULR")You will need to connect on the etcd-server container to do this. (etcdctl成员添加“名称”“ PeerULR”)您将需要在etcd服务器容器上进行连接。

6 - Start kubelet and protokube services on the new master. 6-在新的主服务器上启动kubelet和protokube服务。

  1. If that is not the case than you might have a problem with the certs. 如果不是这种情况,则说明您的证书可能有问题。 They are provisioned during the creation of the cluster and some of them have the allowed master's endpoints. 它们是在集群创建期间提供的,其中一些具有允许的主节点端点。 If that is the case you'd need to create new certs and roll them for the api server/etcd clusters. 如果是这种情况,则需要创建新的证书并将其滚动到api服务器/ etcd群集。

Please let me know if that helped. 请让我知道是否有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从群集节点上的docker卷曲到主节点 - curl from docker on cluster node to master node kubectl cluster-info 为什么在控制平面而不是主节点上运行 - kubectl cluster-info why is running on control plane and not master node 有什么办法可以通过“ crm”命令从Linux-HA群集中找到主节点? - any way to find out Master Node from Linux-HA cluster by “crm” command? 是否有任何方法/ api可以识别Linux-HA群集的主/从节点? - is there any method/api to identify master/slave node of Linux-HA cluster? 在单节点集群弹性搜索 v 7.11.1 中未发现弹性搜索主机 - elastic search master not discovered in single node cluster elastic search v 7.11.1 尝试更新ubuntu上的节点时,它会失败并显示linux-image-extra…不会安装 - When trying to update node on ubuntu it fails with linux-image-extra… not going to be installed Hadoop单节点集群错误 - Hadoop single node cluster error on Spark独立群集从属服务器无法将从属服务器连接到主服务器 - spark standalone cluster slave unable to connect slave to master 尝试切换到 master 分支时出现“错误:pathspec 'master' 与 git 已知的任何文件都不匹配” - "error: pathspec 'master' did not match any file(s) known to git" when trying to switch to master branch 更新yum失败 - Update yum fails
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM