[英]Kafka python client alternating between assigned and rebalancing not processing data
I have a Kafka topic with 40 partitions.我有一个有 40 个分区的 Kafka 主题。 In a Kubernetes cluster.
在 Kubernetes 集群中。 I further have a microservice that consumes from this topic.
我还有一个使用这个主题的微服务。
Sometimes it happens, within a batch process, that at one point there are some partitions left with unprocessed data while most partitions are finished.有时,在批处理过程中,有时会出现一些分区未处理的数据,而大多数分区已完成。 Using the
kafka-consumer-groups.sh
this looks like this:使用
kafka-consumer-groups.sh
这看起来像这样:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
- - - - - kafka-python-2.0.1-f1259971-c8ed-4d98-ba37-40f263b14a78/10.44.2.119 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-328f6a97-22ea-4f59-b702-4173feb9f025/10.44.0.29 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-9a2ea04e-3bf1-40f4-9262-6c14d0791dfc/10.44.7.35 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-81f5be15-535c-436c-996e-f8098d0613a1/10.44.4.26 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-ffcf76e2-f0ed-4894-bc70-ee73220881db/10.44.14.2 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-fc5709a0-a0b5-4324-92ff-02b6ee0f1232/10.44.2.123 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-c058418c-51ec-43e2-b666-21971480665b/10.44.15.2 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-0c14afab-af2a-4668-bb3c-015932fbfd13/10.44.14.5 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-1cb308f0-203f-43ae-9252-e0fc98eb87b8/10.44.14.4 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-42753a7f-80d0-481e-93a6-67445cb1bb5e/10.44.14.6 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-63e97395-e1ec-4cab-8edc-c5dd251932af/10.44.2.122 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-7116fdc2-809f-4f99-b5bd-60fbf2aba935/10.44.1.37 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-f5ef8ff1-f09c-498e-9b27-1bcac94b895b/10.44.2.125 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-8feec117-aa3a-42c0-91e8-0ccefac5f134/10.44.2.121 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-45cc5605-d3c8-4c77-8ca8-88afbde81a69/10.44.14.3 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-9a575ac4-1531-4b2a-b516-12ffa2496615/10.44.5.32 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-d33e112b-a1f4-4699-8989-daee03a5021c/10.44.14.7 kafka-python-2.0.1
my-topic 20 890 890 0 - - -
my-topic 38 857 857 0 - - -
my-topic 28 918 918 0 - - -
my-topic 23 66 909 843 - - -
my-topic 10 888 888 0 - - -
my-topic 2 885 885 0 - - -
my-topic 7 853 853 0 - - -
my-topic 16 878 878 0 - - -
my-topic 15 47 901 854 - - -
my-topic 26 934 934 0 - - -
my-topic 32 898 898 0 - - -
my-topic 21 921 921 0 - - -
my-topic 13 933 933 0 - - -
my-topic 5 879 879 0 - - -
my-topic 12 945 945 0 - - -
my-topic 4 918 918 0 - - -
my-topic 29 924 924 0 - - -
my-topic 39 895 895 0 - - -
my-topic 25 30 926 896 - - -
my-topic 9 915 915 0 - - -
my-topic 35 31 890 859 - - -
my-topic 3 69 897 828 - - -
my-topic 1 911 911 0 - - -
my-topic 6 22 901 879 - - -
my-topic 14 41 881 840 - - -
my-topic 30 900 900 0 - - -
my-topic 22 847 847 0 - - -
my-topic 8 919 919 0 - - -
my-topic 0 902 902 0 - - -
my-topic 18 924 924 0 - - -
my-topic 36 864 864 0 - - -
my-topic 34 929 929 0 - - -
my-topic 24 864 864 0 - - -
my-topic 19 937 937 0 - - -
my-topic 27 859 859 0 - - -
my-topic 11 838 838 0 - - -
my-topic 31 49 922 873 - - -
my-topic 37 882 882 0 - - -
my-topic 17 942 942 0 - - -
my-topic 33 928 928 0 - - -
It further states that the consumer group is rebalancing
.它还进一步指出,消费者群体正在
rebalancing
。 One thing to note here is that under CONSUMER-ID
there are fewer consumers stated as there should be.这里要注意的一件事是,在
CONSUMER-ID
下,应有的消费者较少。 It should be 20 consumers but in this output, there are only 17 shown even though all pods run.它应该是 20 个消费者,但在此输出中,即使所有 pod 都在运行,也只显示了 17 个。 This number varies and I am not sure if it is an output issue or if they are really not there.
这个数字各不相同,我不确定这是输出问题还是它们真的不存在。 This also baffles me because when I initially start (all new Kafka and consumer deployments) this does not happen.
这也让我感到困惑,因为当我最初开始(所有新的 Kafka 和消费者部署)时,这不会发生。 So it really seems to be related to consumer deployments being scaled, or otherwise killed.
因此,它似乎确实与消费者部署的扩展或以其他方式被杀死有关。
It then happens for a short time that the consumers get assigned and after about half a minute the same picture as above shows again where the consumer group is rebalancing.然后会在短时间内分配消费者,大约半分钟后,与上图相同的图片再次显示消费者组正在重新平衡的位置。
This happens also when I scale down.当我缩小规模时也会发生这种情况。 Eg when I only have 4 consumers.
例如,当我只有 4 个消费者时。 I am not sure what's happening here.
我不确定这里发生了什么。 The pods all run and I use the same kind of base code and pattern in other microservices where it seems to work fine.
Pod 都在运行,我在其他微服务中使用了相同类型的基本代码和模式,在这些微服务中似乎可以正常工作。
I suspect that it has something to do with a consumer pod getting killed because, as I said, with a new deployment it works initially.我怀疑这与消费者 pod 被杀死有关,因为正如我所说,它最初可以在新部署中工作。 This batch is also a bit more long-running than the others I have so a pod kill is more likely during its run.
该批次也比我拥有的其他批次运行时间更长,因此在运行期间更有可能杀死 pod。 I am also not sure if it has something to do with most partitions already being finished, this could also just be a quirk of my use case.
我也不确定它是否与大多数已经完成的分区有关,这也可能只是我的用例的一个怪癖。
I recognized this because the processing seemed to take forever but new data was still processed.我认识到这一点,因为处理似乎需要永远,但仍在处理新数据。 So I think what happens is that for the brief moment when the consumers are assigned they process data but they never commit the offset before getting rebalanced leaving them in an infinite loop.
所以我认为发生的事情是,在分配消费者的短暂时刻,他们处理数据,但在重新平衡之前他们从未提交偏移量,从而使他们处于无限循环中。 The only slightly related thing I found was this issue but it is from quite some versions before and does not fully describe my situation.
我发现的唯一稍微相关的事情是这个问题,但它来自之前的一些版本,并没有完全描述我的情况。
I use the kafka-python
client and I use the kafka image confluentinc/cp-kafka:5.0.1
.我使用
kafka-python
客户端,我使用 kafka 图像confluentinc/cp-kafka:5.0.1
。
I create the topic using the admin client NewTopic(name='my-topic', num_partitions=40, replication_factor=1)
and create the client like so:我使用管理客户端
NewTopic(name='my-topic', num_partitions=40, replication_factor=1)
创建NewTopic(name='my-topic', num_partitions=40, replication_factor=1)
并像这样创建客户端:
consumer = KafkaConsumer(consume_topic,
bootstrap_servers=bootstrap_servers,
group_id=consume_group_id,
value_deserializer=lambda m: json.loads(m))
for message in consumer:
process(message)
What is going wrong here?这里出了什么问题? Do I have some configuration error?
我有一些配置错误吗?
Any help is greatly appreciated.任何帮助是极大的赞赏。
The issue was with the heartbeat configuration.问题出在心跳配置上。 It turns out that while most messages only need seconds to process, few messages take very long to process.
事实证明,虽然大多数消息只需要几秒钟来处理,但很少有消息需要很长时间来处理。 In these special cases the heartbeat update took too long for some of the consumers resulting in the broker to assume the consumer is down and start a rebalance.
在这些特殊情况下,某些消费者的心跳更新时间太长,导致代理假设消费者已关闭并开始重新平衡。
I assume what happened next is the consumers getting reassigned to the same message, taking too long to process it again and triggering yet another rebalance.我假设接下来发生的事情是消费者被重新分配到同一条消息,再次处理它需要很长时间并触发另一个重新平衡。 Resulting in an endless cycle.
导致无限循环。
I finally solved it by increasing both session_timeout_ms
and heartbeat_interval_ms
in the consumer (documented here ).我最终通过增加消费者中的
session_timeout_ms
和heartbeat_interval_ms
来解决它(记录在这里)。 I also decreased the batch size so that the heartbeat is updated more regularly.我还减少了批量大小,以便更定期地更新心跳。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.