I have a Kafka topic with 40 partitions. In a Kubernetes cluster. I further have a microservice that consumes from this topic.
Sometimes it happens, within a batch process, that at one point there are some partitions left with unprocessed data while most partitions are finished. Using the kafka-consumer-groups.sh
this looks like this:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
- - - - - kafka-python-2.0.1-f1259971-c8ed-4d98-ba37-40f263b14a78/10.44.2.119 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-328f6a97-22ea-4f59-b702-4173feb9f025/10.44.0.29 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-9a2ea04e-3bf1-40f4-9262-6c14d0791dfc/10.44.7.35 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-81f5be15-535c-436c-996e-f8098d0613a1/10.44.4.26 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-ffcf76e2-f0ed-4894-bc70-ee73220881db/10.44.14.2 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-fc5709a0-a0b5-4324-92ff-02b6ee0f1232/10.44.2.123 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-c058418c-51ec-43e2-b666-21971480665b/10.44.15.2 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-0c14afab-af2a-4668-bb3c-015932fbfd13/10.44.14.5 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-1cb308f0-203f-43ae-9252-e0fc98eb87b8/10.44.14.4 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-42753a7f-80d0-481e-93a6-67445cb1bb5e/10.44.14.6 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-63e97395-e1ec-4cab-8edc-c5dd251932af/10.44.2.122 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-7116fdc2-809f-4f99-b5bd-60fbf2aba935/10.44.1.37 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-f5ef8ff1-f09c-498e-9b27-1bcac94b895b/10.44.2.125 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-8feec117-aa3a-42c0-91e8-0ccefac5f134/10.44.2.121 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-45cc5605-d3c8-4c77-8ca8-88afbde81a69/10.44.14.3 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-9a575ac4-1531-4b2a-b516-12ffa2496615/10.44.5.32 kafka-python-2.0.1
- - - - - kafka-python-2.0.1-d33e112b-a1f4-4699-8989-daee03a5021c/10.44.14.7 kafka-python-2.0.1
my-topic 20 890 890 0 - - -
my-topic 38 857 857 0 - - -
my-topic 28 918 918 0 - - -
my-topic 23 66 909 843 - - -
my-topic 10 888 888 0 - - -
my-topic 2 885 885 0 - - -
my-topic 7 853 853 0 - - -
my-topic 16 878 878 0 - - -
my-topic 15 47 901 854 - - -
my-topic 26 934 934 0 - - -
my-topic 32 898 898 0 - - -
my-topic 21 921 921 0 - - -
my-topic 13 933 933 0 - - -
my-topic 5 879 879 0 - - -
my-topic 12 945 945 0 - - -
my-topic 4 918 918 0 - - -
my-topic 29 924 924 0 - - -
my-topic 39 895 895 0 - - -
my-topic 25 30 926 896 - - -
my-topic 9 915 915 0 - - -
my-topic 35 31 890 859 - - -
my-topic 3 69 897 828 - - -
my-topic 1 911 911 0 - - -
my-topic 6 22 901 879 - - -
my-topic 14 41 881 840 - - -
my-topic 30 900 900 0 - - -
my-topic 22 847 847 0 - - -
my-topic 8 919 919 0 - - -
my-topic 0 902 902 0 - - -
my-topic 18 924 924 0 - - -
my-topic 36 864 864 0 - - -
my-topic 34 929 929 0 - - -
my-topic 24 864 864 0 - - -
my-topic 19 937 937 0 - - -
my-topic 27 859 859 0 - - -
my-topic 11 838 838 0 - - -
my-topic 31 49 922 873 - - -
my-topic 37 882 882 0 - - -
my-topic 17 942 942 0 - - -
my-topic 33 928 928 0 - - -
It further states that the consumer group is rebalancing
. One thing to note here is that under CONSUMER-ID
there are fewer consumers stated as there should be. It should be 20 consumers but in this output, there are only 17 shown even though all pods run. This number varies and I am not sure if it is an output issue or if they are really not there. This also baffles me because when I initially start (all new Kafka and consumer deployments) this does not happen. So it really seems to be related to consumer deployments being scaled, or otherwise killed.
It then happens for a short time that the consumers get assigned and after about half a minute the same picture as above shows again where the consumer group is rebalancing.
This happens also when I scale down. Eg when I only have 4 consumers. I am not sure what's happening here. The pods all run and I use the same kind of base code and pattern in other microservices where it seems to work fine.
I suspect that it has something to do with a consumer pod getting killed because, as I said, with a new deployment it works initially. This batch is also a bit more long-running than the others I have so a pod kill is more likely during its run. I am also not sure if it has something to do with most partitions already being finished, this could also just be a quirk of my use case.
I recognized this because the processing seemed to take forever but new data was still processed. So I think what happens is that for the brief moment when the consumers are assigned they process data but they never commit the offset before getting rebalanced leaving them in an infinite loop. The only slightly related thing I found was this issue but it is from quite some versions before and does not fully describe my situation.
I use the kafka-python
client and I use the kafka image confluentinc/cp-kafka:5.0.1
.
I create the topic using the admin client NewTopic(name='my-topic', num_partitions=40, replication_factor=1)
and create the client like so:
consumer = KafkaConsumer(consume_topic,
bootstrap_servers=bootstrap_servers,
group_id=consume_group_id,
value_deserializer=lambda m: json.loads(m))
for message in consumer:
process(message)
What is going wrong here? Do I have some configuration error?
Any help is greatly appreciated.
The issue was with the heartbeat configuration. It turns out that while most messages only need seconds to process, few messages take very long to process. In these special cases the heartbeat update took too long for some of the consumers resulting in the broker to assume the consumer is down and start a rebalance.
I assume what happened next is the consumers getting reassigned to the same message, taking too long to process it again and triggering yet another rebalance. Resulting in an endless cycle.
I finally solved it by increasing both session_timeout_ms
and heartbeat_interval_ms
in the consumer (documented here ). I also decreased the batch size so that the heartbeat is updated more regularly.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.