繁体   English   中英

arangodb 集群重启失败

[英]arangodb cluster restart failure

我们在 5 个服务器上建立了一个带有 3 个代理、5 个协调器和 5 个数据库服务器的 arangodb 集群。

环境:Centos 6

我们的经验是,如果超过其中一台服务器的最大内存,集群将完全失败。 为了避免它并且由于我们没有找到控制内存使用的方法,我们使用命令top |grep arangod定期观察每个节点,如果它们消耗过多,则重新启动它们。 它通常工作正常。 但是当我们尝试重新启动一个节点时,我们收到如下日志:

    2018/03/27 15:47:31 Failed to get master URL, retrying in 5sec (All 3 servers responded with temporary failure)
    2018/03/27 15:47:31 ## Start of dbserver log
        2018-03-27T07:46:31Z [37755] WARNING {memory} It is recommended to set NUMA to interleaved.
        2018-03-27T07:46:31Z [37755] WARNING {memory} put 'numactl --interleave=all' in front of your command
        2018-03-27T07:46:31Z [37755] INFO using storage engine rocksdb
        2018-03-27T07:46:31Z [37755] INFO {cluster} Starting up with role PRIMARY
        2018-03-27T07:46:41Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 21 (9.84s). Network checks advised.
        2018-03-27T07:46:42Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 22 (10.82s). Network checks advised.
        2018-03-27T07:46:43Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 23 (11.89s). Network checks advised.
        2018-03-27T07:46:44Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 24 (13.03s). Network checks advised.
        2018-03-27T07:46:46Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 25 (14.25s). Network checks advised.
        2018-03-27T07:46:47Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 26 (15.57s). Network checks advised.
        2018-03-27T07:46:48Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 27 (16.99s). Network checks advised.
        2018-03-27T07:46:50Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 28 (18.51s). Network checks advised.
        2018-03-27T07:46:51Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 29 (20.15s). Network checks advised.
        2018-03-27T07:46:53Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 30 (21.9s). Network checks advised.
        2018-03-27T07:46:55Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 31 (23.8s). Network checks advised.
        2018-03-27T07:46:57Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 32 (25.83s). Network checks advised.
        2018-03-27T07:46:59Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 33 (28.01s). Network checks advised.
        2018-03-27T07:47:02Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 34 (30.36s). Network checks advised.
        2018-03-27T07:47:04Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 35 (32.89s). Network checks advised.
        2018-03-27T07:47:04Z [37755] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 36 (32.89s). Network checks advised.
2018/03/27 15:47:31 ## End of dbserver log
2018/03/27 15:47:32 ## Start of coordinator log
        2018-03-27T07:46:32Z [37769] WARNING {memory} It is recommended to set NUMA to interleaved.
        2018-03-27T07:46:32Z [37769] WARNING {memory} put 'numactl --interleave=all' in front of your command
        2018-03-27T07:46:32Z [37769] INFO using storage engine rocksdb
        2018-03-27T07:46:32Z [37769] INFO {cluster} Starting up with role COORDINATOR
        2018-03-27T07:46:42Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 21 (9.84s). Network checks advised.
        2018-03-27T07:46:43Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 22 (10.82s). Network checks advised.
        2018-03-27T07:46:44Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 23 (11.89s). Network checks advised.
        2018-03-27T07:46:45Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 24 (13.03s). Network checks advised.
        2018-03-27T07:46:47Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 25 (14.25s). Network checks advised.
        2018-03-27T07:46:48Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 26 (15.57s). Network checks advised.
        2018-03-27T07:46:49Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 27 (16.99s). Network checks advised.
        2018-03-27T07:46:51Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 28 (18.51s). Network checks advised.
        2018-03-27T07:46:52Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 29 (20.14s). Network checks advised.
        2018-03-27T07:46:54Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 30 (21.9s). Network checks advised.
        2018-03-27T07:46:56Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 31 (23.8s). Network checks advised.
        2018-03-27T07:46:58Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 32 (25.83s). Network checks advised.
        2018-03-27T07:47:00Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.30:8531. Unsuccessful consecutive tries: 33 (28.01s). Network checks advised.
        2018-03-27T07:47:03Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 34 (30.36s). Network checks advised.
        2018-03-27T07:47:05Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.29:8531. Unsuccessful consecutive tries: 35 (32.89s). Network checks advised.
        2018-03-27T07:47:05Z [37769] INFO {agencycomm} Flaky agency communication to http+tcp://65.18.27.28:8531. Unsuccessful consecutive tries: 36 (32.89s). Network checks advised.
2018/03/27 15:47:32 ## End of coordinator log
2018/03/27 15:47:46 Failed to get master URL, retrying in 5sec (All 3 servers responded with temporary failure)

所有服务器之间都可以ping通,所以不是网络问题。

就在我写这个问题并收集日志信息时,集群成功重启。 这有点奇怪。 现在 2 个节点将日志打印为

updated cluster config does not contain myself. rejecting

现在显示集合需要很长时间,并且集群无法正常工作。 有人知道为什么吗?

[引用github讨论]

请注意,只有在第一次启动集群时才必须使用命令 --cluster.agency-size 5。 这是因为 starter 在第一次启动时写入了无法再更改的集群配置。

因此,在您的情况下,如果您需要在其他节点中添加更多代理,则必须在每个新节点上使用 --cluster.start-agent true 如果您想确保在关闭时您的 5 个节点集群正在运行两个(随机)节点,那么您需要一个代理大小 = 5

如果代理未启动并运行,集群将无法工作。 该机构使用 RAFT 协议。 如果您的代理由 3 个代理组成,那么如果有两个代理停止运行,则代理停止运行(对于您的集群也是如此)。 如果您的代理由 5 个代理组成,那么如果两个代理宕机,代理将继续存在(对于您的集群也是如此)

如果你想生存到 3 台机器停机,那么其他设置是可能的

您还可以考虑为代理使用单独的机器,例如:

  • 3台机构专用机器
  • 加上额外的 3 台用于 DBServers+Coordinators 的机器(总共 6 台机器),复制因子 = 3

上述设置将在 1 个 Agent 宕机和 2 个 DBServers 宕机的情况下继续存在(因此总共有 3 台机器宕机)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM