简体繁体 English

Cassandra依赖于任意命令

[英]Cassandra hangs on arbitrary commands

原文 2014-07-29 15:58:04 0 1 cassandra/ cql/ phpcassa

We're hosting Cassandra 2.0.2 cluster on AWS. 我们在AWS上托管Cassandra 2.0.2群集。 We've recently started upgrading from normal to SSD drives, by bootstrapping new and decommissioning old nodes. 我们最近开始从普通驱动器升级到SSD驱动器，通过引导新节点和退役旧节点。 It went fairly well, aside from two nodes hanging forever on decommission. 它运行得相当好，除了两个节点永远停止退役。 Now, after the new 6 nodes are operational, we noticed that some of our old tools, using phpcassa stopped working. 现在，在新的6个节点运行后，我们注意到一些旧的工具，使用phpcassa停止工作。 Nothing has changed with security groups, all ports TCP/UDP are open, telnet can connect via 9160, cqlsh can 'use' a cluster, select data, however, 'describe cluster' fails, in cli, 'show keyspaces' also fails - and by fail, I mean never exits back to prompt, nor returns any results. 安全组没有任何改变，所有端口TCP / UDP都是打开的，telnet可以通过9160连接，cqlsh可以“使用”一个集群，选择数据，然而，'描述集群'失败，在cli中，'show keyspaces'也失败了 - 失败，我的意思是永远不会退出提示，也不会返回任何结果。 The queries work perfectly from the new nodes, but even the old nodes waiting to be decommissioned cannot perform them. 查询在新节点中运行良好，但即使是等待退役的旧节点也无法执行。 The production system, also using phpcassa, does normal data requests - it works fine. 生产系统，也使用phpcassa，执行正常的数据请求 - 它工作正常。

All cassandras have the same config, the same versions, the same package they were installed from. 所有cassandras都具有相同的配置，相同的版本，安装它们的相同包。 All nodes were recently restarted, due to seed node change. 由于种子节点更改，最近重新启动了所有节点。

Versions: 版本：

I've run out out of ideas. 我的想法已经用完了。 Any hints would be greatly appreciated. 任何提示将不胜感激。

Update: 更新：

After a bit of random investigating, here's a bit more detailed description. 经过一些随机调查，这里有更详细的描述。

If I cassandra-cli to any machine, and do "show keyspaces", it works. 如果我cassandra-cli到任何机器，并做“显示键空间”，它的工作原理。

If I cassandra-cli to a remote machine, and do "show keyspaces", it hangs indefinitely. 如果我将cassandra-cli连接到远程计算机并执行“显示键空间”，它将无限期挂起。

If I cqlsh to a remote cassandra, and do a describe keyspaces, it hangs. 如果我cqlsh到远程cassandra，并做一个描述键空间，它会挂起。 ctrl+c, repeat the same query, it instantly responds. ctrl + c，重复相同的查询，它立即响应。

If I cqlsh to a local cassandra, and do a describe keyspaces, it works. 如果我cqlsh到本地cassandra，并做一个描述键空间，它的工作原理。

If I cqlsh to a local cassandra, and do a select * from Keyspace limit x, it will return data up to a certain limit. 如果我cqlsh到本地cassandra，并从Keyspace限制x执行select *，它将返回一定数量的数据。 I was able to return data with limit 760, the 761 would fail. 我能够以限制760返回数据，761将失败。

If I do a consistency all, and select the same, it hangs. 如果我所有的一致性，并选择相同，它会挂起。

If I do a trace, different machines return the data, though sometimes source_elapsed is "null" 如果我进行跟踪，不同的机器会返回数据，但有时source_elapsed为“null”

Not to forget, applications querying the cluster sometimes do get results, after several attempts. 不要忘记，经过多次尝试后，查询群集的应用程序有时会获得结果。

Update 2 更新2

Further playing introduced failed bootstrapping of two nodes, one hanging on bootstrap for 4 days, and eventually failing, possibly due to a rolling restart, and the other plain failing after 2 days. 进一步播放引入了两个节点的失败引导失败，一个挂在引导程序上4天，最终失败，可能是由于滚动重启，另一个在2天后失败。 Repairs wouldn't function, and introduced "Stream failed" errors, as well as "Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException". 修复不起作用，并引入“Stream failed”错误，以及“Thread Thread [StorageServiceShutdownHook，5，main] java.lang.NullPointerException”中的“Exception”。 Also, after executing repair, started getting "Read an invalid frame size of 0. Are you using tframedtransport on the client side?", so.. 此外，执行修复后，开始获取“读取无效的帧大小为0.您是否在客户端使用tframedtransport？”，所以..

Solution 解

Switch rpc_server_type from hsha to sync. 将hsha的rpc_server_type切换为同步。 All problems gone. 所有问题都没了。 We worked with hsha for months without issues. 我们和hsha一起工作了好几个月。

If someone also stumbles here: http://planetcassandra.org/blog/post/hsha-thrift-server-corruption-cassandra-2-0-2-5/ 如果有人也偶然发现： http ： //planetcassandra.org/blog/post/hsha-thrift-server-corruption-cassandra-2-0-2-5/