简体   繁体   中英

Multiple Kafka Producers writing to the same topic - how to load balance consumption

So I have a design where I have multiple producers P1, P2, P3, P4... PN writing to a single topic T1, that has 32 partitions.

On the other side I have up to 32 consumers on a single consumer group.

I would like to load balance my message consumption.

Reading the docs I could see 3 options:
1. Define the partition myself (drawback I would have to know where the last message was sent or define a partition range for each Producer P)
2. Define a key and leave the partition decision to the Kafka hash algorithm (drawback - load balancing would be defined on luck)

(As per Chris answer the load balancing should be left to hash algorithm) -the reality shows this does not provide equal distribution to the consumers as the consumers are bound to partitions and I would have to understand the hash algorithm to chose a good key - which to me sound the same as picking the partition (and that would have to be distributed over the producers)

My current code is using UUID as the key. The analysis of the partitions chosen, and consequently the consumers working, shows a distribution that may be far from being equal. I'm reproducing it below:

分区收到的消息 The image above shows the number of messages received by each partitions in a 5 minutes window using UUID as my key - at that point in time I had 8 consumers. The consumption takes about 2 minutes. The cells in red shows a 9 request queue in one of the consumers, while other consumers had low loads - or zero load like the consumer in green. If a random key is not a good option, what should I chose?

  1. No partition, no key and leave to the Kafka round robin algorithm (drawback the round robin is internal to the Producer - meaning all producers could be sending the message to the same partition - I also tested this option and the result is below:

循环是生产者内部的 The image above shows round robin is, apparently, internal to the producer.

Do I really need to write the overall load balancing algorithm myself? Am I missing something?

Balancing load across consumers is one of the defining features of Kafka that allows horizontal scaling.

The record key used by the producer is what allows this to work. The key defines which partition the message goes on, and any partition will be consumed sequentially by one consumer, and so your producers should use a key strategy that produces an even spread and that ensures related messages have the same key if ordering is important (bear in mind there are other considerations around in flight requests if strict ordering is critical).

The former is what balances the load - there is no round-robin involved in consumers, partitions are just shared out as evenly as possible among consumers in each group and they poll independently. If keys are well spread then each partition will have about the same number of records.

So, to enable effective load balancing your only responsibility is to use a good strategy for creating message keys, and define your topics with at least as many partitions as you plan to scale out consumption to.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM