简体   繁体   中英

Kafka-streams delay to kick rebalancing on consumer graceful shutdown

This is a follow up on a previous question I sent regarding high latency in our Kafka Streams; ( Kafka Streams rebalancing latency spikes on high throughput kafka-streams services ).

As a quick reminder, our Stateless service has very tight latency requirements and we are facing too high latency problems (some messages consumed more than 10 secs after being produced) specially when a consumer leaves gracefully the group.

After further investigation we have found out that at least for small consumer groups the rebalance is taking less than 500ms. So we thought, where is this huge latency when removing one consumer (>10s) coming from?

We realized that it is the time between the consumer exiting Gracefully and the rebalance kicking in.

That previous tests were executed with all-default configurations in both Kafka and Kafka Streams application. We changed the configurations to:

properties.put("max.poll.records", 50); // defaults to 1000 in kafkastreams
properties.put("auto.offset.reset", "latest"); // defaults to latest
properties.put("heartbeat.interval.ms", 1000);
properties.put("session.timeout.ms", 6000);
properties.put("group.initial.rebalance.delay.ms", 0);
properties.put("max.poll.interval.ms", 6000);

And the result is that the time for the rebalance to start dropped to a bit more than 5 secs.

We also tested to kill a consumer non-gracefully by 'kill -9' it; the result is that the time to trigger the rebalance is exactly the same.

So we have some questions: - We expected that when the consumer is stopping gracefully the rebalance is triggered right away, should that be the expected behavior? why isn't it happening in our tests? - How can we reduce the time between a consumer gracefully exiting and the rebalance being triggered? what are the tradeoffs? more unneeded rebalances?

For more context, our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0. On the consumer side, we are using Kafka-streams 2.1.0.

Thank you!

Kafka Streams does not sent a "leave group request" when an instance is shut down gracefully -- this is on purpose. The goal is to avoid expensive rebalances if an instance is bounced (eg, if one upgrades an application; or if one runs in a Kubernetes environment and a POD is restarted quickly automatically).

To achieve this, a non public configuration is used. You can overwrite the config via

props.put("internal.leave.group.on.close", true); // Streams' default is `false`

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM