简体   繁体   English

arangodb 集群重启失败

[英]arangodb cluster restart failure

We setted up an arangodb cluster with 3 agents,5 coordinators and 5 db servers on 5 servers.我们在 5 个服务器上建立了一个带有 3 个代理、5 个协调器和 5 个数据库服务器的 arangodb 集群。

Env: Centos 6环境:Centos 6

We had the experience that if it exceeded the max memory on one of the servers,the cluster would fail entirely.我们的经验是,如果超过其中一台服务器的最大内存,集群将完全失败。 In order to avoid it and as we didn't find a way to control the memory use,we observe every nodes regularly with the command top |grep arangod and restart the ones if they consume too much.为了避免它并且由于我们没有找到控制内存使用的方法,我们使用命令top |grep arangod定期观察每个节点,如果它们消耗过多,则重新启动它们。 It usually works fine.它通常工作正常。 But as we tried to restart one node,we received the logs as follow:但是当我们尝试重新启动一个节点时,我们收到如下日志:

    2018/03/27 15:47:31 Failed to get master URL, retrying in 5sec (All 3 servers responded with temporary failure)
    2018/03/27 15:47:31 ## Start of dbserver log
        2018-03-27T07:46:31Z [37755] WARNING {memory} It is recommended to set NUMA to interleaved.
        2018-03-27T07:46:31Z [37755] WARNING {memory} put 'numactl --interleave=all' in front of your command
        2018-03-27T07:46:31Z [37755] INFO using storage engine rocksdb
        2018-03-27T07:46:31Z [37755] INFO {cluster} Starting up with role PRIMARY
        2018-03-27T07:46:41Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 21 (9.84s). Network checks advised.
        2018-03-27T07:46:42Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 22 (10.82s). Network checks advised.
        2018-03-27T07:46:43Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 23 (11.89s). Network checks advised.
        2018-03-27T07:46:44Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 24 (13.03s). Network checks advised.
        2018-03-27T07:46:46Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 25 (14.25s). Network checks advised.
        2018-03-27T07:46:47Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 26 (15.57s). Network checks advised.
        2018-03-27T07:46:48Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 27 (16.99s). Network checks advised.
        2018-03-27T07:46:50Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 28 (18.51s). Network checks advised.
        2018-03-27T07:46:51Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 29 (20.15s). Network checks advised.
        2018-03-27T07:46:53Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 30 (21.9s). Network checks advised.
        2018-03-27T07:46:55Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 31 (23.8s). Network checks advised.
        2018-03-27T07:46:57Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 32 (25.83s). Network checks advised.
        2018-03-27T07:46:59Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 33 (28.01s). Network checks advised.
        2018-03-27T07:47:02Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 34 (30.36s). Network checks advised.
        2018-03-27T07:47:04Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 35 (32.89s). Network checks advised.
        2018-03-27T07:47:04Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 36 (32.89s). Network checks advised.
2018/03/27 15:47:31 ## End of dbserver log
2018/03/27 15:47:32 ## Start of coordinator log
        2018-03-27T07:46:32Z [37769] WARNING {memory} It is recommended to set NUMA to interleaved.
        2018-03-27T07:46:32Z [37769] WARNING {memory} put 'numactl --interleave=all' in front of your command
        2018-03-27T07:46:32Z [37769] INFO using storage engine rocksdb
        2018-03-27T07:46:32Z [37769] INFO {cluster} Starting up with role COORDINATOR
        2018-03-27T07:46:42Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 21 (9.84s). Network checks advised.
        2018-03-27T07:46:43Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 22 (10.82s). Network checks advised.
        2018-03-27T07:46:44Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 23 (11.89s). Network checks advised.
        2018-03-27T07:46:45Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 24 (13.03s). Network checks advised.
        2018-03-27T07:46:47Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 25 (14.25s). Network checks advised.
        2018-03-27T07:46:48Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 26 (15.57s). Network checks advised.
        2018-03-27T07:46:49Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 27 (16.99s). Network checks advised.
        2018-03-27T07:46:51Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 28 (18.51s). Network checks advised.
        2018-03-27T07:46:52Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 29 (20.14s). Network checks advised.
        2018-03-27T07:46:54Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 30 (21.9s). Network checks advised.
        2018-03-27T07:46:56Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 31 (23.8s). Network checks advised.
        2018-03-27T07:46:58Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 32 (25.83s). Network checks advised.
        2018-03-27T07:47:00Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 33 (28.01s). Network checks advised.
        2018-03-27T07:47:03Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 34 (30.36s). Network checks advised.
        2018-03-27T07:47:05Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 35 (32.89s). Network checks advised.
        2018-03-27T07:47:05Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 36 (32.89s). Network checks advised.
2018/03/27 15:47:32 ## End of coordinator log
2018/03/27 15:47:46 Failed to get master URL, retrying in 5sec (All 3 servers responded with temporary failure)

All the servers ping well between each other,so it's not a problem of network.所有服务器之间都可以ping通,所以不是网络问题。

Just as I was writing this question and collecting log info,the cluster restarted successfully.就在我写这个问题并收集日志信息时,集群成功重启。 It is kind of wierd.这有点奇怪。 And now 2 of the nodes print the log as现在 2 个节点将日志打印为

updated cluster config does not contain myself. rejecting

It's now taking really long time to show collections and the cluster is not working normally.现在显示集合需要很长时间,并且集群无法正常工作。 Anybody know why?有人知道为什么吗?

[quoting the github discussion] [引用github讨论]

Please note that the command --cluster.agency-size 5 has to be used only when starting the Cluster for the first time.请注意,只有在第一次启动集群时才必须使用命令 --cluster.agency-size 5。 This is due to the starter that on the first startup it writes the configuration of the Cluster that cannot be changed anymore.这是因为 starter 在第一次启动时写入了无法再更改的集群配置。

So in your case, if you need to add more agents in additional nodes, you have to use --cluster.start-agent true on each newly nodes If you want to be sure that your 5 nodes cluster is up an running when bringing down two (random) nodes, then an agency size = 5 is what you need因此,在您的情况下,如果您需要在其他节点中添加更多代理,则必须在每个新节点上使用 --cluster.start-agent true 如果您想确保在关闭时您的 5 个节点集群正在运行两个(随机)节点,那么您需要一个代理大小 = 5

The Cluster cannot work if the Agency is not up and running.如果代理未启动并运行,集群将无法工作。 The Agency makes use of the RAFT protocol.该机构使用 RAFT 协议。 If your Agency is made of 3 Agents, then if two are down then the Agency is down (and same for your Cluster).如果您的代理由 3 个代理组成,那么如果有两个代理停止运行,则代理停止运行(对于您的集群也是如此)。 If you Agency is made of 5 Agents, then if two agents are down the Agency will survive (and same for your Cluster)如果您的代理由 5 个代理组成,那么如果两个代理宕机,代理将继续存在(对于您的集群也是如此)

If you want to survive to 3 machines down, then other setups are possible如果你想生存到 3 台机器停机,那么其他设置是可能的

You can also consider using separate machines for the Agency, eg:您还可以考虑为代理使用单独的机器,例如:

  • 3 dedicated machines for the Agency 3台机构专用机器
  • plus additional 3 machines for DBServers+Coordinators (total 6 machines) with Replication factor = 3加上额外的 3 台用于 DBServers+Coordinators 的机器(总共 6 台机器),复制因子 = 3

The above setup will survive to 1 Agent down and 2 DBServers down (so 3 machines down in total)上述设置将在 1 个 Agent 宕机和 2 个 DBServers 宕机的情况下继续存在(因此总共有 3 台机器宕机)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM