kubernetes 上的 etcd 数据库集群行为不端

Question

In my project we have etcd DB deployed on Kubernetes (this etcd is for application use, separate from the Kubernetes etcd) on on-prem.在我的项目中，我们将 etcd DB 部署在本地 Kubernetes（此 etcd 用于应用程序，与 Kubernetes etcd 分开）上。 So I deployed it using the bitnami helm chart as a statefulset.所以我使用 bitnami helm chart 作为 statefulset 来部署它。 Initially, at the time of deployment, the number of replicas was 1 as we wanted a single instance of etcd DB earlier.最初，在部署时，副本数为 1，因为我们之前想要一个 etcd DB 实例。

The real problem started when we scaled it up to 3. I updated configuration to scale it up by updating the ETCD_INITIAL_CLUSTER with two new members DNS name:真正的问题开始于我们将其扩展到 3。我更新了配置以通过使用两个新成员 DNS 名称更新 ETCD_INITIAL_CLUSTER 来扩展它：

etcd-0=http://etcd-0.etcd-headless.wallet.svc.cluster.local:2380,etcd-1=http://etcd-1.etcd-headless.wallet.svc.cluster.local:2380,etcd-2=http://etcd-2.etcd-headless.wallet.svc.cluster.local:2380

Now when I go inside any of etcd pod and run etcdctl member list I only get a list of member and none of them is selected as leader, which is wrong.现在，当我在任何 etcd pod 中运行 go 并运行etcdctl member list时，我只得到一个成员列表，并且没有一个成员被选为领导者，这是错误的。 One among three should be the leader.三人中的一位应该是领导者。

Also after running for some time these pods start giving heartbeat exceeds error and server overload error:同样在运行一段时间后，这些 pod 开始给出心跳超出错误和服务器过载错误：

W |  etcdserver: failed to send out heartbeat on time (exceeded the 950ms timeout for 593.648512ms, to a9b7b8c4e027337a
W | etcdserver: server is likely overloaded
W | wal: sync duration of 2.575790761s, expected less than 1s

I changed the heartbeat default value accordingly, the number of errors decreased but still, I get a few heartbeat exceed errors along with others.我相应地更改了心跳默认值，错误数量减少了，但仍然有一些心跳超出错误以及其他错误。

Not sure what is the problem here, is it the i/o that's causing the problem?不确定这里的问题是什么，是导致问题的 i/o 吗？ If yes I am not sure how to be sure.如果是，我不确定如何确定。

Will really appreciate any help on this.非常感谢您对此的任何帮助。

Answer 1

I don't think the heartbeats are the main problem, it also seems the logs that you are seeing are Warning logs.我不认为心跳是主要问题，您看到的日志似乎也是警告日志。 So it's possible that some heartbeats are missed here and there but your nodes are node(s) are not crashing or mirroring.因此，可能会在这里和那里遗漏一些心跳，但您的节点是节点没有崩溃或镜像。

It's likely that you changed the replica numbers and your new replicas are not joining the cluster.您可能更改了副本编号，而您的新副本未加入集群。 So, I would recommend following this guide for you to add the new members to the cluster.因此，我建议您按照本指南将新成员添加到集群中。 Basically with etcdctl something like this:基本上使用etcdctl是这样的：

etcdctl member add node2 --peer-urls=http://node1:2380
etcdctl member add node3 --peer-urls=http://node1:2380,http://node2:2380

Note that you will have to run these commands in a pod that has access to all your etcd nodes in your cluster.请注意，您必须在可以访问集群中所有 etcd 节点的 pod 中运行这些命令。

You could also consider managing your etcd cluster with the etcd operator which should be able to take care of the scaling and removal/addition of nodes.您还可以考虑使用 etcd 操作员管理您的 etcd 集群，该操作员应该能够处理节点的扩展和删除/添加。

✌️ ✌️

Answer 2

Okay, I had two problems:好的，我有两个问题：

"failed to send out heartbeat" Warning messages. “发送心跳失败”警告消息。
"No leader election". “没有领导人选举”。

Next day i found out the reason of second problem, actually i had startup parameter set in the pod definition.第二天我发现了第二个问题的原因，实际上我在 pod 定义中设置了启动参数。 ETCDCTL_API: 3 ETCDCTL_API：3

so when i run "etcdctl member list" with APIv3 it doesn't mention which member is selected as reader.所以当我使用 APIv3 运行“etcdctl 成员列表”时，它没有提到哪个成员被选为阅读器。

$ ETCDCTL_API=3 etcdctl member list
    
    3d0bc1a46f81ecd9, started, etcd-2, http://etcd-2.etcd-headless.wallet.svc.cluster.local:2380, http://etcd-2.etcd-headless.wallet.svc.cluster.local:2379, false
    b6a5d762d566708b, started, etcd-1, http://etcd-1.etcd-headless.wallet.svc.cluster.local:2380, http://etcd-1.etcd-headless.wallet.svc.cluster.local:2379, false


$ ETCDCTL_API=2 etcdctl member list
    
    3d0bc1a46f81ecd9, started, etcd-2, http://etcd-2.etcd-headless.wallet.svc.cluster.local:2380, http://etcd-2.etcd-headless.wallet.svc.cluster.local:2379, false
    b6a5d762d566708b, started, etcd-1, http://etcd-1.etcd-headless.wallet.svc.cluster.local:2380, http://etcd-1.etcd-headless.wallet.svc.cluster.local:2379, true

So when i use APIv2 i can see which node is elected as leader and there were no problem with leader election.因此，当我使用 APIv2 时，我可以看到哪个节点被选为领导者，并且领导者选举没有问题。 Still working on heartbeat warning but i guess i need to tune the config in order to avoied that.仍在处理心跳警告，但我想我需要调整配置以避免这种情况。

NB: I have 3 nodes, stopped one for testing.注意：我有 3 个节点，停止了一个进行测试。

kubernetes 上的 etcd 数据库集群行为不端

问题描述

2 个解决方案

解决方案1
1 2020-07-21 20:29:07

解决方案2
0 2020-07-30 12:20:41

kubernetes 上的 etcd 数据库集群行为不端

问题描述

2 个解决方案

解决方案1 1 2020-07-21 20:29:07

解决方案2 0 2020-07-30 12:20:41

解决方案1
1 2020-07-21 20:29:07

解决方案2
0 2020-07-30 12:20:41