简体   繁体   中英

How does Flink consume messages from a Kafka topic with multiple partitions, without getting skewed?

Say we have 3 kafka partitions for one topic, and I want my events to be windowed by the hour, using event time.

Will the kafka consumer stop reading from a partition when it is outside of the current window? Or does it open a new window? If it is opening new windows then wouldn't it be theoretically possible to have it open a unlimited amount of windows and thus run out of memory, if one partition's event time would be very skewed compared to the others? This scenario would especially be possible when we are replaying some history.

I have been trying to get this answer from reading documentation, but can not find much about the internals of Flink with Kafka on partitions. Some good documentation on this specific topic would be very welcome.

Thanks!

You could try to use this type of style

public void runStartFromLatestOffsets() throws Exception {
        // 50 records written to each of 3 partitions before launching a latest-starting consuming job
        final int parallelism = 3;
        final int recordsInEachPartition = 50;

        // each partition will be written an extra 200 records
        final int extraRecordsInEachPartition = 200;

        // all already existing data in the topic, before the consuming topology has started, should be ignored
        final String topicName = writeSequence("testStartFromLatestOffsetsTopic", recordsInEachPartition, parallelism, 1);

        // the committed offsets should be ignored
        KafkaTestEnvironment.KafkaOffsetHandler kafkaOffsetHandler = kafkaServer.createOffsetHandler();
        kafkaOffsetHandler.setCommittedOffset(topicName, 0, 23);
        kafkaOffsetHandler.setCommittedOffset(topicName, 1, 31);
kafkaOffsetHandler.setCommittedOffset(topicName, 2, 43);

So first of all events from Kafka are read constantly and the further windowing operations have no impact on that. There are more things to consider when talking about running out-of-memory.

  • usually you do not store every event for a window, but just some aggregate for the event
  • whenever window is closed the corresponding memory is freed.

Some more on how Kafka consumer interacts with EventTime (watermarks in particular you can check here

May be this will help you to get clarity

https://github.com/apache/flink/blob/release-1.2/flink-connectors/flink-connector-kafka-base/src/main/java/org/apache/flink/streaming/connectors/kafka/partitioner/FixedPartitioner.java

A partitioner ensuring that each internal Flink partition ends up in one Kafka partition.

Go though full commented part.They have explained various cases also.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM