简体繁体 English

由于网络波动，Cassandra 3节点群集八卦协议问题

[英]Cassandra 3 node cluster gossip protocol issue due to network fluctuation

原文 2016-10-01 13:34:19 9 2 cassandra/ datastax

I've got Cassandra 3.7 cluster of 3 nodes with the keyspace replication factor of 3. 我有3个节点的Cassandra 3.7集群，密钥空间复制因子为3。

All the 3 nodes are started and are in sync. 所有3个节点都已启动并且处于同步状态。 When one of the cassandra node went down, I restarted it, the node gets in sync with the other node. 当其中一个cassandra节点发生故障时，我重新启动它，节点与另一个节点同步。

Now my question is when one of the node has issues like frequent fluctuation in network (cassandra is still up and running). 现在我的问题是，当其中一个节点出现网络频繁波动（cassandra仍然正常运行）等问题时。 Say node 1 is having network issues, the nodetool status on the other 2 nodes shows that the node 1 is down. 假设节点1出现网络问题，其他2个节点上的nodetool状态显示节点1已关闭。 When the network is back on the node 1 the nodetool status shows that the other nodes are down. 当网络返回节点1时，nodetool状态显示其他节点已关闭。

Below are the configuration changes I made in the cassandra.yaml files. 以下是我在cassandra.yaml文件中所做的配置更改。

Node 01 cluster_name: 'Test Cluster' num_tokens: 256 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "10.1.1.4,10.1.1.5,10.1.1.6" listen_address: 10.1.1.4 broadcast_address: 10.1.1.4 rpc_address: 0.0.0.0 broadcast_rpc_address: 10.1.1.4 Node02 cluster_name: 'Test Cluster' num_tokens: 256 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "10.1.1.4,10.1.1.5,10.1.1.6" listen_address: 10.1.1.5 broadcast_address: 10.1.1.5 rpc_address: 0.0.0.0 broadcast_rpc_address: 10.1.1.5 Node03 cluster_name: 'Test Cluster' num_tokens: 256 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "10.1.1.4,10.1.1.5,10.1.1.6" listen_address: 10.1.1.6 broadcast_address: 10.1.1.6 rpc_address: 0.0.0.0 broadcast_rpc_address: 10.1.1.6

Nodetool status on node 1 when the network is up shows that the other nodes are down (DN). 网络启动时节点1上的Nodetool状态显示其他节点已关闭（DN）。

Nodetool status on the other nodes shows that the node 1 is down (DN) 其他节点上的Nodetool状态显示节点1已关闭（DN）

How does the gossip protocol work in this scenario? 八卦协议在这种情况下如何工作？

Why the node 1 is not in sync with the other nodes when the network is up? 网络启动时为什么节点1与其他节点不同步？

Please help me on this. 请帮帮我。

Thanks in Advance, 提前致谢，

GKK GKK

2 个解决方案

Ensure that you are using the Ec2MultiRegionSnitch snitch if you are using AWS, or the snitch appropriate. 如果您使用的是AWS或适当的snitch ，请确保使用的是Ec2MultiRegionSnitch snitch 。

Also, if you are using AWS, the public IP address may be different to the private one, in which case you may need to modify the listen_address: and broadcast_address: values accordingly. 此外，如果您使用的是AWS，则公共IP地址可能与私有IP地址不同，在这种情况下，您可能需要相应地修改listen_address:和broadcast_address:值。

Finally for the seeds value, I would omit the source IP as part of the seed, in other words: 最后，对于seeds值，我会省略源IP作为种子的一部分，换句话说：

Node 01 ... seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "10.1.1.5,10.1.1.6" listen_address: 10.1.1.4 ...

Node02 ... seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "10.1.1.4,10.1.1.6" listen_address: 10.1.1.5 ...

Node03 ... seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "10.1.1.4,10.1.1.5" listen_address: 10.1.1.6 ...

Try to lower the number of the seeds. 尽量减少种子数量。 1 or 2 should be enough. 1或2应该就够了。 By the way, use same seed list on all nodes. 顺便说一下，在所有节点上使用相同的种子列表。