简体   繁体   English

卡夫卡不断重新平衡消费者

[英]Kafka keeps rebalancing consumers

We have 10 consumers in a group listening for a topic.我们有 10 个消费者在一个小组中聆听一个主题。 What is happening very often is to see the consumers being rebalanced very often (which completely stops the consumer process for some time).经常发生的事情是看到消费者经常被重新平衡(这会在一段时间内完全停止消费者进程)。

# ./kafka-consumer-groups.sh --describe --bootstrap-server localhost:9092  --describe  --group ParserKafkaPipeline | grep -e ParserBody | sort
ParserBodyToParse 0          99              99              0               consumer-1-f29b7eb7-b871-477c-af52-446fbf4b0496  /10.12.18.58    consumer-1
ParserBodyToParse 1          97              97              0               consumer-10-6639ee02-8e68-40e6-aca1-eabd89bf828e /10.12.18.58    consumer-10
ParserBodyToParse 2          97              97              0               consumer-11-c712db8b-0396-4388-9e3a-e8e342355547 /10.12.18.58    consumer-11
ParserBodyToParse 3          97              98              1               consumer-12-0cc6fe12-d640-4344-91c0-f15e63c20cca /10.12.18.58    consumer-12
ParserBodyToParse 4          97              98              1               consumer-13-b904a958-141d-412e-83ea-950cd51e25e0 /10.12.18.58    consumer-13
ParserBodyToParse 5          97              98              1               consumer-14-7c70ba88-8b8c-4fad-b15b-cf7692a4b9ce /10.12.18.58    consumer-14
ParserBodyToParse 6          98              98              0               consumer-15-f0983c3d-8704-4127-808d-ec8b6b847008 /10.12.18.58    consumer-15
ParserBodyToParse 7          97              97              0               consumer-18-de5d20dd-217c-4db2-9b39-e2fdbca386e9 /10.12.18.58    consumer-18
ParserBodyToParse 8          98              98              0               consumer-5-bdeaf30a-d2bf-4aec-86ea-9c35a7acfe21  /10.12.18.58    consumer-5
ParserBodyToParse 9          98              98              0               consumer-9-4de1bf17-9474-4bd4-ae61-4ab254f52863  /10.12.18.58    consumer-9

# ./kafka-consumer-groups.sh --describe --bootstrap-server localhost:9092  --describe  --group ParserKafkaPipeline | grep -e ParserBody | sort
Warning: Consumer group 'ParserKafkaPipeline' is rebalancing.
ParserBodyToParse 0          99              99              0               -               -               -
ParserBodyToParse 1          99              99              0               -               -               -
ParserBodyToParse 2          99              99              0               -               -               -
ParserBodyToParse 3          99              100             1               -               -               -
ParserBodyToParse 4          99              100             1               -               -               -
ParserBodyToParse 5          99              100             1               -               -               -
ParserBodyToParse 6          100             100             0               -               -               -
ParserBodyToParse 7          99              99              0               -               -               -
ParserBodyToParse 8          100             100             0               -               -               -
ParserBodyToParse 9          100             100             0               -               -               -

Notice the warning in the second call above.请注意上面第二次调用中的警告。

Consuming these messages might take a long time, but it shouldn't take more than two minutes.使用这些消息可能需要很长时间,但不应超过两分钟。 I checked that the limit on consumer.poll is 5 minutes, which shouldn't be an issue.我检查了consumer.poll的限制是 5 分钟,这应该不是问题。 Are there some logs to check what exactly is happening?是否有一些日志可以检查到底发生了什么?

UPDATE:更新:

We use Kafka 2.2.1 and Java consumer.我们使用 Kafka 2.2.1 和 Java 消费者。 We didn't change the default value of max.session and max.heartbeat .我们没有更改max.sessionmax.heartbeat的默认值。 The consumer is basically waiting for IO from other service, so it is not using any CPU – that is why I expect the heartbeat should be working correctly.消费者基本上是在等待来自其他服务的 IO,所以它没有使用任何 CPU——这就是为什么我希望心跳应该正常工作的原因。

Our consumer code is following:我们的消费者代码如下:

    inline fun <reified T : Any> consume(
            topic: KafkaTopic,
            groupId: String,
            batchSize: Int = 50,
            crossinline consume: (key: String?, value: T) -> (Unit)
    ) = thread {
        val consumerProperties = Properties()
        consumerProperties.putAll(properties)
        consumerProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
        consumerProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, batchSize)

        val consumer = KafkaConsumer<String?, ByteArray>(consumerProperties)

        consumer.subscribe(listOf(topic.toString()))

        while (true) try {
            val records = consumer.poll(Duration.ofMinutes(pollDurationMinutes))
            log.debug("Topic $topic consumed by group $groupId: ${records.count()} records.")
            records.forEach { record -> consumeRecord(record, topic, consume) }
        } catch (e: Exception) {
            log.fatal("Couldn't consume records: ${e.message}.", e)
            // sleep to prevent logging hell when connection failure
            Thread.sleep(1000)
        }
    }

Frequent rebalances are usually caused because it is taking too long for the consumer to process batches.频繁的重新平衡通常是因为消费者处理批次花费的时间太长。 This happens because the consumer is processing the batch for a long time (and heartbeats are not being sent) and therefore the brokers think that consumer was lost and they start re-balancing.发生这种情况是因为消费者长时间处理批处理(并且没有发送心跳),因此经纪人认为消费者丢失了,他们开始重新平衡。

I would either suggest to create smaller batches by reducing the value of max.partition.fetch.bytes or extend/increase heartbeat intervals by increasing the value of heartbeat.interval.ms .我要么建议通过减少max.partition.fetch.bytes的值来创建更小的批次,要么通过增加 heartbeat.interval.ms 的值来扩展/增加heartbeat.interval.ms间隔。

I think that the first part of Giorgos answer is correct, up to ".....processing the batch for a long time" but the configuration advice is for a different problem.我认为 Giorgos 答案的第一部分是正确的,直到“.....长时间处理批处理”,但配置建议是针对不同的问题。

There are two causes of a rebalance, too long between polls or too long between heartbeats.重新平衡有两个原因,轮询之间的时间过长或心跳之间的时间过长。 The logs should tell you which has caused rebalance, but it is usually the former.日志应该告诉你是哪个导致了重新平衡,但通常是前者。

If the problem is heartbeat then the advised configuration changes may help, and/or session.timeout.ms .如果问题是心跳,那么建议的配置更改可能会有所帮助,和/或session.timeout.ms The heartbeat runs in a separate thread and allows the group to quickly determine if a consumer application has died.心跳在单独的线程中运行,并允许组快速确定消费者应用程序是否已死亡。

If the problem is too long between polls and you can't speed up your processing then you need to increase the allowed gap between calling poll, or reduce the number of records you handle on each poll.如果轮询之间的问题太长并且您无法加快处理速度,那么您需要增加调用轮询之间的允许间隔,或者减少每次轮询处理的记录数。 The relevant properties are max.poll.interval (default 5 minutes) or max.poll.records (default 500)相关属性为max.poll.interval (默认 5 分钟)或max.poll.records (默认 500)

For anyone who encounters this even after they feel processing of records is not the bottleneck:对于任何遇到这种情况的人,即使他们认为处理记录不是瓶颈:

We recently encountered a nasty bug in Kafka Connect Runtime which will keep heartbeat threads and spin up more Kafka Connect Tasks with same thread name (Essentially, not killing older task threads and heartbeat threads)我们最近在 Kafka Connect Runtime 中遇到了一个令人讨厌的错误,它将保留心跳线程并启动更多具有相同线程名称的 Kafka Connect 任务(本质上,不会杀死旧的任务线程和心跳线程)

Following bugs were encountered in version 2.3.1 and few other versions as mentioned in the JIRA.在 2.3.1 版本和 JIRA 中提到的少数其他版本中遇到了以下错误。

https://issues.apache.org/jira/browse/KAFKA-9841 https://issues.apache.org/jira/browse/KAFKA-9841

https://issues.apache.org/jira/browse/KAFKA-10574 https://issues.apache.org/jira/browse/KAFKA-10574

https://issues.apache.org/jira/browse/KAFKA-9184 https://issues.apache.org/jira/browse/KAFKA-9184

Also happened in Confluent Platform version 5.3.1, so please upgrade your kafka connect runtime and connect-api to latest versions if possible. Confluent 平台版本 5.3.1 也发生过,因此请尽可能将您的 kafka 连接运行时和连接 API 升级到最新版本。

In the end we abandoned Kafka and now using Google Cloud Pub/Sub.最后我们放弃了 Kafka,现在使用 Google Cloud Pub/Sub。 Works without any single issue.没有任何问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM