简体   繁体   中英

Partitions processing stuck until state store is rebuilt during rebalancing in Kafka Streams

Let's assume I have stateful Kafka Streams application that consumes data from topic with 3 partitions. At the moment I have 2 instances of the above application running. Let's put it like that: instance1 have partitions part1 and part2 assigned, instance2 has part3 .

So now I want to add the new instance to utilize the parallelization completely.

In my understanding, as soon as I start a new instance, the rebalancing occurs: one of partitions part1 or part2 and corresponding local state stores will be migrated from the existing instance to the newly added instance. In this example, let's imagine that part1 migrates on instance3 .

At the same time, I realize that new instance instance3 will not start processing new data until it restores the local state store from the changelog topic, which may take much time.

During the period from starting the application and until it restores the state store:

  • does it mean that the data from part1 is not being processed and stuck until instance3 finishes the start up?
  • if yes, then what are the approaches to estimate how much time will it take for instance3 to build the local state store?
  • during this time, are other instances not affected by rebalancing and keep processing data with no downtime ( instance1 - part2 , instance2 - part3 )?

The rebalance upon adding of the new instance is at consumer group level. What this means is that all the partitions assigned to all the consumers of the consumer group would be revoked and then re-distributed. So all the partitions - part1, part2 and part3 would be stuck till the rebalancing is complete.

Now to estimate the downtime is a bit tricky. You could emit events on rebalance trigger and consumption start - then compute the time difference between the two events to get an estimate of downtime. If you have a simple java consumer logs, you can also get a rough estimate as all relevant logs (partitions revoked as well as partitions assigned) are already present.

Rebalancing has evolved with the recent releases:

from version 2.4.0 with KIP-429

  • the incremental cooperative rebalancing is added that came instead of the stop-the-world rebalancing protocol
  • optimized for cloud in sense of better rebalance behavior for falling out members (eg when Pod is dead and restarts)
  • consumer does not need to revoke a partition if the group coordinator reassigns the same partition to the consumer again

=> part2 and part3 are not stuck and continued to be processed

from version 2.6.0 with KIP-441

  • improve Kafka Streams scaling out behavior, especially for stateful tasks
  • previously some tasks have been blocked in processing until the state store is rebuilt which may take hours
  • now the new instance first tries to catch-up the state store from change log and only then takes the task as active
  • no downtime during the scale out

=> part1 continues to be processed on instance1 until instance3 rebuilds the state store for part1 and ready to hand over of its processing

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM