简体   繁体   中英

How does Kafka stream get distributed among TaskManagers in Flink?

Say a Flink Job (three task managers tm1 , tm2 & tm3 ) consumes Kafka topic as a source, how does the stream gets distributed among them? Who does the distribution?

This is done in FlinkKafkaConsumerBase , in its open() method. The Flink runtime context provides methods that each instance can use to determine the total number of parallel instances of the Flink Kafka consumer, as well as the index of a specific instance. Each instance uses these methods to independently take responsibility for reading from specific partitions.

Adding to what David wrote you should keep one thing in mind: The max. parallism of a KafkaProducer is limited by the number of partitions. Since Flink will start distributing the tasks starting with the first slot (the first task-manager) and then go on with the 2nd and so on and repeat this for each source, you might see an unbalanced workload if you have more task-managers than topic-partitions.

In a scenario where you have many kafka-sources with a small number of topic-partitions this imbalance becomes more and more visible. In an extrem case you have many sources with only one partition all this sources will get consumed by the first slot/task-manager. You can work around this edge case if you use Slot sharing groups . This is of course an edge case but it might be good to have this in your mind when you define your resources and workflows.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM