Kafka-streams 延迟启动消费者正常关闭的重新平衡

Question

This is a follow up on a previous question I sent regarding high latency in our Kafka Streams;这是对我之前发送的有关 Kafka Streams 中的高延迟问题的跟进； ( Kafka Streams rebalancing latency spikes on high throughput kafka-streams services ). （ Kafka Streams 重新平衡了高吞吐量 kafka-streams 服务上的延迟峰值）。

As a quick reminder, our Stateless service has very tight latency requirements and we are facing too high latency problems (some messages consumed more than 10 secs after being produced) specially when a consumer leaves gracefully the group.快速提醒一下，我们的无状态服务有非常严格的延迟要求，我们面临着延迟过高的问题（一些消息在生成后消耗超过 10 秒），特别是当消费者优雅地离开组时。

After further investigation we have found out that at least for small consumer groups the rebalance is taking less than 500ms.经过进一步调查，我们发现至少对于小型消费群体而言，重新平衡所需的时间不到 500 毫秒。 So we thought, where is this huge latency when removing one consumer (>10s) coming from?所以我们想，当删除一个消费者（> 10s）时，这个巨大的延迟来自哪里？

We realized that it is the time between the consumer exiting Gracefully and the rebalance kicking in.我们意识到这是消费者优雅退出和重新平衡开始之间的时间。

That previous tests were executed with all-default configurations in both Kafka and Kafka Streams application.之前的测试是在 Kafka 和 Kafka Streams 应用程序中使用所有默认配置执行的。 We changed the configurations to:我们将配置更改为：

properties.put("max.poll.records", 50); // defaults to 1000 in kafkastreams
properties.put("auto.offset.reset", "latest"); // defaults to latest
properties.put("heartbeat.interval.ms", 1000);
properties.put("session.timeout.ms", 6000);
properties.put("group.initial.rebalance.delay.ms", 0);
properties.put("max.poll.interval.ms", 6000);

And the result is that the time for the rebalance to start dropped to a bit more than 5 secs.结果是重新平衡开始的时间下降到 5 秒多一点。

We also tested to kill a consumer non-gracefully by 'kill -9' it;我们还测试了通过“kill -9”非优雅地杀死消费者； the result is that the time to trigger the rebalance is exactly the same.结果是触发重新平衡的时间完全相同。

So we have some questions: - We expected that when the consumer is stopping gracefully the rebalance is triggered right away, should that be the expected behavior?所以我们有一些问题： - 我们期望当消费者正常停止时，立即触发重新平衡，这应该是预期的行为吗？ why isn't it happening in our tests?为什么在我们的测试中没有发生？ - How can we reduce the time between a consumer gracefully exiting and the rebalance being triggered? - 我们如何减少消费者正常退出和触发重新平衡之间的时间？ what are the tradeoffs?权衡是什么？ more unneeded rebalances?更多不需要的再平衡？

For more context, our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0.对于更多上下文，我们的 Kafka 版本是 1.1.0，在查看找到的 libs 之后，例如 kafka/kafka_2.11-1.1.0-cp1.jar，我们安装了 Confluent 平台 4.1.0。 On the consumer side, we are using Kafka-streams 2.1.0.在消费者方面，我们使用的是 Kafka-streams 2.1.0。

Thank you!谢谢！

Answer 1

Kafka Streams does not sent a "leave group request" when an instance is shut down gracefully -- this is on purpose.当实例正常关闭时，Kafka Streams 不会发送“离开组请求”——这是故意的。 The goal is to avoid expensive rebalances if an instance is bounced (eg, if one upgrades an application; or if one runs in a Kubernetes environment and a POD is restarted quickly automatically).目标是避免在实例被退回时进行昂贵的重新平衡（例如，如果一个应用程序升级；或者如果一个应用程序在 Kubernetes 环境中运行并且一个 POD 自动快速重启）。

To achieve this, a non public configuration is used.为了实现这一点，使用了非公共配置。 You can overwrite the config via您可以通过覆盖配置

props.put("internal.leave.group.on.close", true); // Streams' default is `false`

Kafka-streams 延迟启动消费者正常关闭的重新平衡

问题描述

1 个解决方案

解决方案1
1 2019-02-01 17:32:28

Kafka-streams 延迟启动消费者正常关闭的重新平衡

问题描述

1 个解决方案

解决方案1 1 2019-02-01 17:32:28

解决方案1
1 2019-02-01 17:32:28