简体   繁体   中英

How to detect a Kafka Streams app in zombie state

One of our Kafka Streams Application's StreamThread consumers entered a zombie state after producing the following log message:

[Consumer clientId=notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer, groupId=notification-processor] Member notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer-b2b9eac3-c374-43e2-bbc3-d9ee514a3c16 sending LeaveGroup request to coordinator ****:9092 (id: 2147483646 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

It seems the StreamThread's Kafka Consumer has left the consumer group, but the Kafka Streams App remained in a RUNNING state while not consuming any new records.

I would like to detect that a Kafka Streams App has entered such a zombie state so it can be shut down and replaced with a new instance. Normally we do this via a Kubernetes health check that verifies that the Kafka Streams App is in a RUNNING or REPARTITIONING state, but that is not working for this case.

Therefore I have two questions:

  1. Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?
  2. How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?

Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?

It depends on the version. In older version (2.1.x and older), Kafka Streams would indeed stay in RUNNING state, even if all threads died. This issue is fixed in v2.2.0 via https://issues.apache.org/jira/browse/KAFKA-7657 .

How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?

Even in older versions, you can register an uncaught exception handler on the KafkaStreams client. This handler is invoked each time a StreamThreads dies.

Btw: In upcoming 2.6.0 release, a new metric alive-stream-threads is added to track the number of running threads: https://issues.apache.org/jira/browse/KAFKA-9753

FYI there's a similar discussion going on right now on the user mailing list -- subject line "kafka stream zombie state"

I'll start by telling you guys what I said there, since there seem to be some misconceptions in the conversation so far: basically, the error message is kind of misleading because it implies this is being logged by consumer itself and that it is currently sending this LeaveGroup/has noticed already that it missed the poll interval. But this message is actually getting logged by the heartbeat thread when it notices that the main consumer thread hasn't polled within the max poll timeout, and is technically just marking it as "needs to rejoin" so that the consumer knows to send this LeaveGroup when it finally does poll again. However, if the consumer thread is actually stuck somewhere in the user/application code and cannot break out to continue the poll loop, then the consumer will never actually trigger a rebalance, try to rejoin, send a LeaveGroup request, etc. So that's why the state continues to be RUNNING rather than REBALANCING.

For the above reason, metrics like num-alive-stream-threads won't help either, since the thread isn't dying -- it's just stuck. In fact even if the thread became unstuck, it would just rejoin and then continue as usual, it wouldn't "die" (since that only occurs when encountering a fatal exception).

Long story short: the broker and heartbeat thread have noticed that the consumer is no longer in the group, but the StreamThread is likely stuck somewhere in the topology and thus the consumer itself actually has no idea that it's been kicked out of the consumer group

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM