简体   繁体   English

如何在僵尸 state 中检测 Kafka Streams 应用程序

[英]How to detect a Kafka Streams app in zombie state

One of our Kafka Streams Application's StreamThread consumers entered a zombie state after producing the following log message:我们的一个 Kafka Streams 应用程序的 StreamThread 消费者在生成以下日志消息后进入了僵尸 state:

[Consumer clientId=notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer, groupId=notification-processor] Member notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer-b2b9eac3-c374-43e2-bbc3-d9ee514a3c16 sending LeaveGroup request to coordinator ****:9092 (id: 2147483646 rack: null) due to consumer poll timeout has expired. [Consumer clientId=notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer, groupId=notification-processor] 成员notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer -b2b9eac3-c374-43e2-bbc3-d9ee514a3c16 发送 LeaveGroup 请求到协调器 ****:9092 (id: 2147483646 rack: null) 由于消费者轮询超时已过期。 This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages.这意味着后续调用 poll() 之间的时间比配置的 max.poll.interval.ms 长,这通常意味着轮询循环花费了太多时间来处理消息。 You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.您可以通过增加 max.poll.interval.ms 或通过使用 max.poll.records 减少 poll() 返回的批次的最大大小来解决这个问题。

It seems the StreamThread's Kafka Consumer has left the consumer group, but the Kafka Streams App remained in a RUNNING state while not consuming any new records. StreamThread 的 Kafka Consumer 似乎已经离开了消费者组,但 Kafka Streams 应用程序仍处于 RUNNING state 中,而没有消耗任何新记录。

I would like to detect that a Kafka Streams App has entered such a zombie state so it can be shut down and replaced with a new instance.我想检测到 Kafka Streams 应用程序已经进入了这样的僵尸 state 以便可以将其关闭并替换为新实例。 Normally we do this via a Kubernetes health check that verifies that the Kafka Streams App is in a RUNNING or REPARTITIONING state, but that is not working for this case.通常我们通过 Kubernetes 运行状况检查来验证 Kafka Streams 应用程序是否处于 RUNNING 或 REPARTITIONING state 中,但这不适用于这种情况。

Therefore I have two questions:因此我有两个问题:

  1. Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers?当 Kafka Streams 应用程序没有活跃的消费者时,它是否会保持在 RUNNING state 中? If yes: why?如果是:为什么?
  2. How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?我们如何检测(以编程方式/通过指标)Kafka Streams 应用程序已进入没有活跃消费者的僵尸 state?

Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers?当 Kafka Streams 应用程序没有活跃的消费者时,它是否会保持在 RUNNING state 中? If yes: why?如果是:为什么?

It depends on the version.这取决于版本。 In older version (2.1.x and older), Kafka Streams would indeed stay in RUNNING state, even if all threads died.在旧版本(2.1.x 和更早版本)中,即使所有线程都死了,Kafka Streams 确实会停留在 RUNNING state 中。 This issue is fixed in v2.2.0 via https://issues.apache.org/jira/browse/KAFKA-7657 .此问题已通过https://issues.apache.org/jira/browse/KAFKA-7657v2.2.0中修复。

How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?我们如何检测(以编程方式/通过指标)Kafka Streams 应用程序已进入没有活跃消费者的僵尸 state?

Even in older versions, you can register an uncaught exception handler on the KafkaStreams client.即使在旧版本中,您也可以在KafkaStreams客户端上注册未捕获的异常处理程序。 This handler is invoked each time a StreamThreads dies.每次StreamThreads死亡时都会调用此处理程序。

Btw: In upcoming 2.6.0 release, a new metric alive-stream-threads is added to track the number of running threads: https://issues.apache.org/jira/browse/KAFKA-9753顺便说一句:在即将发布的 2.6.0 版本中,添加了一个新的指标alive-stream-threads来跟踪正在运行的线程数: https://issues.apache.org/jira/browse/KAFKA-9753

FYI there's a similar discussion going on right now on the user mailing list -- subject line "kafka stream zombie state"仅供参考,现在用户邮件列表上正在进行类似的讨论——主题行“kafka stream 僵尸状态”

I'll start by telling you guys what I said there, since there seem to be some misconceptions in the conversation so far: basically, the error message is kind of misleading because it implies this is being logged by consumer itself and that it is currently sending this LeaveGroup/has noticed already that it missed the poll interval.我将首先告诉你们我在那里说的话,因为到目前为止的对话中似乎存在一些误解:基本上,错误消息有点误导,因为它暗示这是由消费者自己记录的,而且它目前是发送这个 LeaveGroup/已经注意到它错过了轮询间隔。 But this message is actually getting logged by the heartbeat thread when it notices that the main consumer thread hasn't polled within the max poll timeout, and is technically just marking it as "needs to rejoin" so that the consumer knows to send this LeaveGroup when it finally does poll again.但是,心跳线程注意到主消费者线程没有在最大轮询超时内轮询时,这条消息实际上被记录下来,并且在技术上只是将其标记为“需要重新加入”,以便消费者知道发送这个 LeaveGroup当它终于再次轮询时。 However, if the consumer thread is actually stuck somewhere in the user/application code and cannot break out to continue the poll loop, then the consumer will never actually trigger a rebalance, try to rejoin, send a LeaveGroup request, etc. So that's why the state continues to be RUNNING rather than REBALANCING.但是,如果消费者线程实际上卡在用户/应用程序代码中的某个位置并且无法中断以继续轮询循环,那么消费者将永远不会真正触发重新平衡、尝试重新加入、发送 LeaveGroup 请求等。这就是为什么state 继续运行而不是重新平衡。

For the above reason, metrics like num-alive-stream-threads won't help either, since the thread isn't dying -- it's just stuck.由于上述原因,像num-alive-stream-threads这样的指标也无济于事,因为线程并没有死去——它只是卡住了。 In fact even if the thread became unstuck, it would just rejoin and then continue as usual, it wouldn't "die" (since that only occurs when encountering a fatal exception).事实上,即使线程被卡住了,它也会像往常一样重新加入然后继续,它不会“死”(因为只有在遇到致命异常时才会发生这种情况)。

Long story short: the broker and heartbeat thread have noticed that the consumer is no longer in the group, but the StreamThread is likely stuck somewhere in the topology and thus the consumer itself actually has no idea that it's been kicked out of the consumer group长话短说:代理和心跳线程已经注意到消费者不再在组中,但是 StreamThread 很可能卡在拓扑中的某个位置,因此消费者本身实际上并不知道它已被踢出消费者组

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM