简体   繁体   English

Couchbase群集:一个节点向下=>整个群集向下?

[英]Couchbase Cluster: one node down => entire cluster down?

I'm testing Couchbase Server 2.5`. 我正在测试Couchbase Server 2.5`。 I have a cluster with 7 nodes and 3 replicates. 我有一个包含7个节点和3个重复的集群。 In normal condition, the system works fine. 在正常情况下,系统工作正常。

But I failed with this test case: Couchbase cluster's serving 40.000 ops and I stop couchbase service on one server => one node down. 但是我没遇到这个测试案例:Couchbase集群服务于40.000操作,我在一台服务器上停止了couchbase服务=>一个节点关闭。 After that, entire cluster's performance is decreased painfully. 之后,整个集群的性能下降得很痛苦。 It only can server below 1.000 ops. 它只能服务器低于1.000操作。 When I click fail-over then entire cluster return healthy. 当我单击故障转移时,整个群集恢复正常。

I think when a node down then only partial request is influenced. 我认为当节点向下时,只有部分请求受到影响。 Is that right? 是对的吗?

And in reality, when one node down, it will make a big impact to entire cluster? 实际上,当一个节点关闭时,它会对整个集群产生重大影响吗?

Updated: 更新:

I wrote a tool to load test use spymemcached. 我写了一个工具加载测试使用spymemcached。 This tool create multi-thread to connect to Couchbase cluster. 此工具创建多线程以连接到Couchbase群集。 Each thread Set a key and Get this key to check immediately, if success it continues Set/Get another key. 每个线程设置一个密钥并获取此密钥以立即检查,如果成功则继续设置/获取另一个密钥。 If fail, it retry Set/Get and by pass this key if fail in 5 times. 如果失败,则重试Set / Get,如果5次失败则通过此密钥。

This is log of a key when I Set/Get fail. 这是我设置/获取失败时的密钥日志。

2014-04-16 16:22:20.405 INFO net.spy.memcached.MemcachedConnection: Reconnection due to exception handling a memcached operation on {QA sa=/10.0.0.23:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 2660829 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800, topWop=null, toWrite=0, interested=1}. 2014-04-16 16:22:20.405 INFO net.spy.memcached.MemcachedConnection:由于异常处理{QA sa = / 10.0.0.23:11234,#Rops = 2,#Wops = 0,#上的memcached操作而重新连接iq = 0,topRop = Cmd:1不透明:2660829密钥:test_key_2681412 Cas:0 Exp:0标志:0数据长度:800,topWop = null,toWrite = 0,interest = 1}。 This may be due to an authentication failure. 这可能是由于身份验证失败。 OperationException: SERVER: Internal error at net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:192) at net.spy.memcached.protocol.binary.OperationImpl.getStatusForErrorCode(OperationImpl.java:244) at net.spy.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:201) at net.spy.memcached.protocol.binary.OperationImpl.readPayloadFromBuffer(OperationImpl.java:196) at net.spy.memcached.protocol.binary.OperationImpl.readFromBuffer(OperationImpl.java:139) at net.spy.memcached.MemcachedConnection.readBufferAndLogMetrics(MemcachedConnection.java:825) at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:804) at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:684) at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:647) at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:418) at net.spy.memcached.MemcachedConnection.run(MemcachedConnection OperationException:SERVER:net.spy上net.spy.memcached.protocol.binary.OperationImpl.getStatusForErrorCode(OperationImpl.java:244)中net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:192)的内部错误.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:201)位于net.spy.memcached.protocol.binary.OperationImpl的net.spy.memcached.protocol.binary.OperationImpl.readPayloadFromBuffer(OperationImpl.java:196) net.spy.memcached.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:804)net.spy.memcached的net.spy.memcached.MemcachedConnection.readBufferAndLogMetrics(MemcachedConnection.java:825)中的.readFromBuffer(OperationImpl.java:139)。 MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:684)net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:647)net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:418)net.spy.memcached .MemcachedConnection.run(MemcachedConnection .java:1400) 2014-04-16 16:22:20.405 WARN net.spy.memcached.MemcachedConnection: Closing, and reopening {QA sa=/10.0.0.23:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 2660829 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800, topWop=null, toWrite=0, interested=1}, attempt 0. 2014-04-16 16:22:20.406 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 1 Opaque: 2660829 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800 2014-04-16 16:22:20.406 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 0 Opaque: 2660830 Key: test_key_2681412 Cancelled 2014-04-16 16:22:20.407 ERROR net.spy.memcached.protocol.binary.StoreOperationImpl: Error: Internal error 2014-04-16 16:22:20.407 INFO net.spy.memcached.MemcachedConnection: Reconnection due to exception handling a memcached operation on {QA sa=/10.0.0.24:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 2660831 .java:1400)2014-04-16 16:22:20.405 WARN net.spy.memcached.MemcachedConnection:关闭并重新打开{QA sa = / 10.0.0.23:11234,#Rops = 2,#Wops = 0,# iq = 0,topRop = Cmd:1不透明:2660829密钥:test_key_2681412 Cas:0 Exp:0标志:0数据长度:800,topWop = null,toWrite = 0,interest = 1},尝试0. 2014-04-16 16:22:20.406 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:丢弃部分完成的操作:Cmd:1不透明:2660829密钥:test_key_2681412 Cas:0 Exp:0标志:0数据长度:800 2014-04-16 16:22:20.406 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:丢弃部分完成的操作:Cmd:0不透明:2660830密钥:test_key_2681412已取消2014-04-16 16:22:20.407错误net.spy.memcached。 protocol.binary.StoreOperationImpl:错误:内部错误2014-04-16 16:22:20.407 INFO net.spy.memcached.MemcachedConnection:由于异常处理{QA sa = / 10.0.0.24:11234上的memcached操作而重新连接,# Rops = 2,#Wops = 0,#iq = 0,topRop = Cmd:1不透明:2660831 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800, topWop=null, toWrite=0, interested=1}. 密钥:test_key_2681412 Cas:0 Exp:0标志:0数据长度:800,topWop = null,toWrite = 0,interest = 1}。 This may be due to an authentication failure. 这可能是由于身份验证失败。 OperationException: SERVER: Internal error at net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:192) at net.spy.memcached.protocol.binary.OperationImpl.getStatusForErrorCode(OperationImpl.java:244) at net.spy.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:201) at net.spy.memcached.protocol.binary.OperationImpl.readPayloadFromBuffer(OperationImpl.java:196) at net.spy.memcached.protocol.binary.OperationImpl.readFromBuffer(OperationImpl.java:139) at net.spy.memcached.MemcachedConnection.readBufferAndLogMetrics(MemcachedConnection.java:825) at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:804) at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:684) at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:647) at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:418) at net.spy.memcached.MemcachedConnection.run(MemcachedConnection OperationException:SERVER:net.spy上net.spy.memcached.protocol.binary.OperationImpl.getStatusForErrorCode(OperationImpl.java:244)中net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:192)的内部错误.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:201)位于net.spy.memcached.protocol.binary.OperationImpl的net.spy.memcached.protocol.binary.OperationImpl.readPayloadFromBuffer(OperationImpl.java:196) net.spy.memcached.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:804)net.spy.memcached的net.spy.memcached.MemcachedConnection.readBufferAndLogMetrics(MemcachedConnection.java:825)中的.readFromBuffer(OperationImpl.java:139)。 MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:684)net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:647)net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:418)net.spy.memcached .MemcachedConnection.run(MemcachedConnection .java:1400) 2014-04-16 16:22:20.407 WARN net.spy.memcached.MemcachedConnection: Closing, and reopening {QA sa=/10.0.0.24:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 2660831 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800, topWop=null, toWrite=0, interested=1}, attempt 0. 2014-04-16 16:22:20.408 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 1 Opaque: 2660831 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800 2014-04-16 16:22:20.408 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 0 Opaque: 2660832 Key: test_key_2681412 Cancelled .java:1400)2014-04-16 16:22:20.407 WARN net.spy.memcached.MemcachedConnection:关闭并重新打开{QA sa = / 10.0.0.24:11234,#Rops = 2,#Wops = 0,# iq = 0,topRop = Cmd:1不透明:2660831密钥:test_key_2681412 Cas:0 Exp:0标志:0数据长度:800,topWop = null,toWrite = 0,interest = 1},尝试0. 2014-04-16 16:22:20.408 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:丢弃部分完成的操作:Cmd:1不透明:2660831密钥:test_key_2681412 Cas:0 Exp:0标志:0数据长度:800 2014-04-16 16:22:20.408 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:丢弃部分完成的操作:Cmd:0不透明:2660832密钥:test_key_2681412已取消

You should find that 6/7 (ie 85%) of your operations should continue to operate at the same performance. 您会发现6/7(即85%)的操作应该继续以相同的性能运行。 However the 15% of operations which are directed at the vbuckets owned by the now downed node will never complete and likely timeout, and so depending on how your application is handling these timeouts you may see a greater performance drop overall. 然而,针对现在被击落的节点拥有的vbuckets的15%的操作将永远不会完成并且可能超时,因此根据应用程序处理这些超时的方式,您可能会看到整体性能下降。

How are you benchmarking / measuring the performance? 您如何对性能进行基准测试/测量?

Update: OP's extra details 更新:OP的额外细节

I wrote a tool to load test use spymemcached. 我写了一个工具加载测试使用spymemcached。 This tool create multi-thread to connect to Couchbase cluster. 此工具创建多线程以连接到Couchbase群集。 Each thread Set a key and Get this key to check immediately, if success it continues Set/Get another key. 每个线程设置一个密钥并获取此密钥以立即检查,如果成功则继续设置/获取另一个密钥。 If fail, it retry Set/Get and by pass this key if fail in 5 times. 如果失败,则重试Set / Get,如果5次失败则通过此密钥。

The Java SDK is designed to make use of async operations for maximum performance, and this is particularly true when the cluster is degraded and some operations will timeout. Java SDK旨在利用异步操作实现最高性能,当群集性能下降且某些操作超时时尤其如此。 I'd suggest starting running in a single thread but using Futures to handle the get after the set. 我建议开始在单个线程中运行,但使用Futures来处理set之后的get。 For example: 例如:

client.set("key", document).addListener(new OperationCompletionListener() {
    @Override
    public void onComplete(OperationFuture<?> future) throws Exception {
        System.out.println("I'm done!");    
    }
});

This is an extract from the Understanding and Using Asynchronous Operations section of the Java Developer guide. 这是Java Developer指南的Understanding and Using Asynchronous Operations部分的摘录。

There's essentially no reason why given the right code your performance with 85% of nodes up shouldn't be close to 85% of the maximum for a short downtime. 基本上没有理由为什么给定正确的代码,85%的节点上的性能不应该接近最大85%的最短停机时间。

Note that if a node is down for a long time then the replication queues on the other nodes will start to back up and that can impact performance, hence the recommendation of using auto-failover / rebalance to get back to 100% active buckets and re-create replicas to ensure any further node failures don't cause data loss. 请注意,如果某个节点长时间处于关闭状态,则其他节点上的复制队列将开始备份,这会影响性能,因此建议使用自动故障转移/重新平衡来恢复100%活动存储桶并重新启动 - 创建副本以确保任何进一步的节点故障不会导致数据丢失。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM