简体   繁体   English

Cassandra Kube.netes Statefulset NoHostAvailableException

[英]Cassandra Kubernetes Statefulset NoHostAvailableException

I have an application deployed in kube.netes, it consists of cassandra, a go client, and a java client (and other things, but they are not relevant for this discussion).我在 kube.netes 中部署了一个应用程序,它由 cassandra、一个 go 客户端和一个 java 客户端组成(以及其他内容,但它们与本次讨论无关)。 We have used helm to do our deployment.我们使用 helm 进行部署。 We are using a stateful set and a headless service for cassandra. We have configured the clients to use the headless service dns as a contact point for cluster creation.我们正在为 cassandra 使用有状态集和无头服务。我们已将客户端配置为使用无头服务 dns 作为集群创建的联系点。

Everything works great.一切都很好。 Until all of the nodes go down, or some other nefarious combination of nodes going down, I am simulating it by deleting all pods using kubectl delete in succession on all of the cassandra nodes.直到所有节点 go 宕机,或者一些其他恶意的节点组合宕机,我通过在所有 cassandra 节点上连续使用 kubectl delete 删除所有 pod 来模拟它。

When I do this the clients throw NoHostAvailableException in java its当我这样做时,客户在 java 中抛出 NoHostAvailableException

    "java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.200.23.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (1 required but only 0 alive)), /10.200.152.130:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
    which eventually becomes
    "java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)"
in go its
   "gocql: no hosts available in the pool"

I can query cassandra using cqlsh, the node seems fine using nodetool status, all of the new ips are there the image I am using doesnt have.netstat so I have not yet confirmed its listening on the expected port.我可以使用 cqlsh 查询 cassandra,节点使用 nodetool status 似乎很好,所有新的 ips 都在那里我使用的图像没有.netstat 所以我还没有确认它在预期的端口上侦听。

Via executing bash on the two client pods I can see the dns makes sense using nslookup, but....netstat does not show any established connections to cassandra (they are present before I take the nodes down)通过在两个客户端 pod 上执行 bash,我可以看到 dns 使用 nslookup 是有意义的,但是......netstat 没有显示与 cassandra 的任何已建立连接(它们在我关闭节点之前存在)

If I restart my clients everything works fine.如果我重新启动我的客户端,一切正常。

I have googled a lot (I mean a lot), most of what I have found is related to never having a working connection, the most relevant things seem very old (like 2014, 2016).我在谷歌上搜索了很多(我的意思是很多),我发现的大部分内容都与从未有过工作联系有关,最相关的东西看起来很旧(比如 2014 年、2016 年)。

So a node going down is very basic and I would expect everything to work, the cassandra cluster manages itself, it discovers new nodes as they come online, it balances the load, etc. etc.所以一个节点宕机是非常基本的,我希望一切正常,cassandra 集群自我管理,它在新节点上线时发现它们,它平衡负载等等。

If I take my all of my cassandra nodes down slowly, one at a time, everything works fine (I have not confirmed that the load is distributed appropriately and to the correct node, but at least it works)如果我慢慢地关闭我所有的 cassandra 节点,一次一个,一切正常(我还没有确认负载是否适当分配到正确的节点,但至少它可以工作)

So, is there a point where this behaviour is expected?那么,是否存在预期这种行为的点? ie I have taken everything down, nothing was up and running before the last from the first cluster was taken down.. is this behaviour expected?也就是说,我已经把所有东西都拿下来了,在第一个集群的最后一个被拿下之前没有任何东西启动和运行。这种行为是预期的吗?

To me it seems like it should be an easy issue to resolve, not sure whats missing / incorrect, I am surprised that both clients show the same symptoms, makes me think something is not happening with our statefulset and service对我来说,这似乎应该是一个容易解决的问题,不确定缺少什么/不正确,令我惊讶的是两个客户都表现出相同的症状,这让我觉得我们的状态集和服务没有发生什么

I think the problem might lie in the headless DNS service.我认为问题可能出在 headless DNS 服务上。 If all of the nodes go down completely and there are no nodes at all available via the service until pods are replaced, it could cause the driver to hang.如果所有节点 go 完全关闭,并且在更换 pod 之前没有任何节点可通过该服务使用,则可能导致驱动程序挂起。

I've noted that you've used Helm for your deployments but you may be interested in this document for connecting to Cassandra clusters in Kube.netes from the authors of the cass-operator .我注意到您已经使用 Helm 进行部署,但您可能对 cass -operator作者的这篇关于连接到 Kube.netes 中的 Cassandra 集群的文档感兴趣。

I'm going to contact some of the authors and get them to respond here.我将联系一些作者并让他们在这里回复。 Cheers!干杯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM