mesos-master与zookeeper集群崩溃

Question

I am deploying a zookeeper cluster which has 3 nodes. 我正在部署一个有3个节点的zookeeper集群。 I use it to keep my mesos master high availability. 我用它来保持我的mesos主高可用性。 I download the zookeeper-3.4.6.tar.gz tarball and uncompress it to /opt, rename it to /opt/zookeeper, enter the directory, edit the conf/zoo.cfg(pasted below), create a myid file in dataDir(which is set to /var/lib/zookeeper in zoo.cfg), and start zookeeper using ./bin/zkServer.sh start, and it goes well. 我下载了zookeeper-3.4.6.tar.gz tarball并将其解压缩到/ opt，将其重命名为/ opt / zookeeper，进入目录，编辑conf / zoo.cfg（粘贴在下面），在dataDir中创建一个myid文件（在zoo.cfg中设置为/ var / lib / zookeeper），并使用./bin/zkServer.sh start启动zookeeper，它运行良好。 I start all the 3 nodes one by one and they all seems well. 我逐个启动所有3个节点，它们似乎都很好。 I use ./bin/zkCli.sh to connect the server , no problem. 我使用./bin/zkCli.sh来连接服务器，没问题。

But when I start mesos (3 masters and 3 slaves, each node runs a master and a slave), then the masters soon crashed, one by one, and in the webpage http://mesos_master:5050 , slave tab, no slaves are displayed. 但是当我启动mesos（3个主服务器和3个从服务器，每个节点运行一个主服务器和一个服务器）时，主服务器很快就会一个接一个地崩溃，并且在网页http：// mesos_master：5050 ，slave tab，没有奴隶是显示。 But when I run only one zookeeper, these are all fine. 但是，当我只运行一个动物园管理员时，这些都很好。 So I think it's the zookeeper cluster's problem. 所以我认为这是zookeeper集群的问题。

I got 3 PV host in my ubuntu server. 我的ubuntu服务器上有3个PV主机。 they are all running ubuntu 14.04 LTS: node-01, node-02, node-03, I have /etc/hosts in all three nodes like this: 他们都在运行ubuntu 14.04 LTS：node-01，node-02，node-03，我在所有三个节点都有/etc/hosts ，如下所示：

172.16.2.70     node-01
172.16.2.81     node-02
172.16.2.80     node-03

I installed zookeeper, mesos on all the three nodes. 我在所有三个节点上安装了zookeeper，mesos。 Zookeeper configure file is like this (all three nodes) : Zookeeper配置文件是这样的（所有三个节点）：

tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=node-01:2888:3888
server.2=node-02:2888:3888
server.3=node-03:2888:3888

they can be started normally and run well. 它们可以正常启动并运行良好。 And then I start the mesos-master service, using the command line ./bin/mesos-master.sh --zk=zk://172.16.2.70:2181,172.16.2.81:2181,172.16.2.80:2181/mesos --work_dir=/var/lib/mesos --quorum=2 , and after a few seconds, it gives me errors like this: 然后我使用命令行启动mesos-master服务./bin/mesos-master.sh --zk=zk://172.16.2.70:2181,172.16.2.81:2181,172.16.2.80:2181/mesos --work_dir=/var/lib/mesos --quorum=2 ，几秒钟后，它给出了我这样的错误：

F0817 15:09:19.995256  2250 master.cpp:1253] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
    @     0x7fa2b8be71a2  google::LogMessage::Fail()
    @     0x7fa2b8be70ee  google::LogMessage::SendToLog()
    @     0x7fa2b8be6af0  google::LogMessage::Flush()
    @     0x7fa2b8be9a04  google::LogMessageFatal::~LogMessageFatal()

▽
    @     0x7fa2b81a899a  mesos::internal::master::fail()

▽
    @     0x7fa2b8262f8f  _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE

▽
    @     0x7fa2b823fba7  _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
    @     0x7fa2b820f9f3  _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
    @     0x7fa2b826305c  _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
    @           0x4a44e7  std::function<>::operator()()
    @           0x49f3a7  _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
    @           0x499480  process::Future<>::fail()
    @     0x7fa2b806b4b4  process::Promise<>::fail()
    @     0x7fa2b826011b  process::internal::thenf<>()
    @     0x7fa2b82a0757  _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7fa2b82962d9  std::_Bind<>::operator()<>()
    @     0x7fa2b827ee89  std::_Function_handler<>::_M_invoke()
I0817 15:09:20.098639  2248 http.cpp:283] HTTP GET for /master/state.json from 172.16.2.84:54542 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36'
    @     0x7fa2b8296507  std::function<>::operator()()
    @     0x7fa2b827efaf  _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
    @     0x7fa2b82a07fe  _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
    @     0x7fa2b8296507  std::function<>::operator()()
    @     0x7fa2b82e4419  process::internal::run<>()
    @     0x7fa2b82da22a  process::Future<>::fail()
    @     0x7fa2b83136b5  std::_Mem_fn<>::operator()<>()
    @     0x7fa2b830efdf  _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7fa2b8307d7f  _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
    @     0x7fa2b82fe431  _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
    @     0x7fa2b830f065  _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
    @           0x4a44e7  std::function<>::operator()()
    @           0x49f3a7  _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
    @     0x7fa2b82da202  process::Future<>::fail()
    @     0x7fa2b82d2d82  process::Promise<>::fail()
Aborted

sometimes the warning is like this, and then crashed with the same output above: 有时警告是这样的，然后崩溃与上面相同的输出：

0817 15:09:49.745750  2104 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying

I want to know whether zookeeper is deployed and run well in my case, and How can I locate where the problem is. 我想知道zookeeper是否已部署并在我的情况下运行良好，我如何找到问题所在。 Any answers and suggests are welcomed. 欢迎任何答案和建议。 thanks. 谢谢。

Answer 1

Actually, in my case, It's because I didn't open firewall port 5050 to allow three servers to communicate with each others. 实际上，就我而言，这是因为我没有打开防火墙端口5050以允许三台服务器相互通信。 After updating firewall rule, it starts to work as expected. 更新防火墙规则后，它开始按预期工作。

Answer 2

I fall into same issue, I tried different ways and different options and finally --ip option worked for me. 我陷入同样的问题，我尝试了不同的方式和不同的选择，最后--ip选项对我--ip 。 Initially I used --hostname option 最初我使用了--hostname选项

mesos-master --ip=192.168.0.13 --quorum=2 --zk=zk://m1:2181,m2:2181,m3:2181/mesos --work_dir=/opt/mm1 --log_dir=/opt/mm1/logs

Answer 3

You need to check that all mesos/zookeeper master nodes can communicate correctly. 您需要检查所有mesos / zookeeper主节点是否可以正确通信。 For that, you need: 为此，您需要：

Zookeeper ports open: TCP 2181, 2888, 3888 Zookeeper端口打开：TCP 2181,2888,3888
Mesos port open: TCP 5050 Mesos端口打开：TCP 5050
ping available (ICMP message 0 and 8) ping可用（ICMP消息0和8）

If you use FQDN instead of IP in your config, check that the DNS resolution is working correctly as well. 如果在配置中使用FQDN而不是IP，请检查DNS解析是否也正常工作。

Answer 4

将你的mesos masters的work_dir拆分为不同的dir，不要为所有的master使用share work_dir，因为zk

mesos-master与zookeeper集群崩溃

问题描述

4 个解决方案

解决方案1
1 2015-10-09 08:11:16

解决方案2
1 2016-06-01 12:35:44

解决方案3
0 2016-09-08 00:51:56

解决方案4
-1 2016-02-18 03:50:25

mesos-master与zookeeper集群崩溃

问题描述

4 个解决方案

解决方案1 1 2015-10-09 08:11:16

解决方案2 1 2016-06-01 12:35:44

解决方案3 0 2016-09-08 00:51:56

解决方案4 -1 2016-02-18 03:50:25

解决方案1
1 2015-10-09 08:11:16

解决方案2
1 2016-06-01 12:35:44

解决方案3
0 2016-09-08 00:51:56

解决方案4
-1 2016-02-18 03:50:25