简体   繁体   English

Apache Flink:如何在一段时间未收到数据时关闭固定大小 window

[英]Apache Flink: How to close a fix size window when data is not received for certain period of time

I am trying to calculate the rate of incoming events per minute from a Kafka topic based on event time.我正在尝试根据事件时间计算来自 Kafka 主题的每分钟传入事件的速率。 I am using TumblingEventTimeWindows of 1 minute for this.我为此使用了 1 分钟的TumblingEventTimeWindows The code snippet is given below.代码片段如下所示。 I have observed that if I am not receiving any event for a particular window, eg from 2.34 to 2.35, then the previous window of 2.33 to 2.34 does not get close.我观察到,如果我没有收到特定 window 的任何事件,例如从 2.34 到 2.35,那么之前的 2.33 到 2.34 的 window 不会接近。 I understand the risk of losing data for the window of 2.33 to 2.34 (may happen due to system failure, bigger Kafka lag, etc.), but I cannot wait indefinitely.我了解 window 在 2.33 到 2.34 之间丢失数据的风险(可能由于系统故障、更大的 Kafka 滞后等原因而发生),但我不能无限期地等待。 I need to close this window after waiting for a certain period of time, and subsequent windows can continue after the system recovers.我需要等待一段时间后关闭这个window,系统恢复后后续的windows可以继续。 How can I achieve this?我怎样才能做到这一点?

I am trying the following code which is giving the event count per minute for continuous flow of events.我正在尝试以下代码,该代码为连续事件流提供每分钟的事件计数。

    StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
    executionEnvironment.setRestartStrategy(RestartStrategies.fixedDelayRestart(
            3,
            org.apache.flink.api.common.time.Time.of(10, TimeUnit.SECONDS)
    ));
    executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    executionEnvironment.setParallelism(1);
    Properties properties = new Properties();
    properties.setProperty("bootstrap.servers", "localhost:9092");
    properties.setProperty("group.id", "AllEventCountConsumerGroup");
    FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("event_input_topic", new SimpleStringSchema(), properties);
    DataStreamSource<String> kafkaDataStream = environment.addSource(kafkaConsumer);
    kafkaDataStream
            .flatMap(new EventFlatter())
            .filter(Objects::nonNull)
            .assignTimestampsAndWatermarks(WatermarkStrategy
                    .<Entity>forBoundedOutOfOrderness(Duration.ofSeconds(2))
                    .withTimestampAssigner((SerializableTimestampAssigner<Entity>) (element, recordTimestamp) -> element.getTimestamp()))
            .keyBy((KeySelector<Entity, String>) Entity::getTenant)
            .window(TumblingEventTimeWindows.of(Time.minutes(1)))
            .allowedLateness(Time.seconds(10))
            .aggregate(new EventCountAggregator())
            .addSink(eventRateProducer);

Given forBoundedOutOfOrderness(Duration.ofSeconds(2)) , a window for the interval [t, t + 1 minute) won't close until after an event with timestamp >= t + 1 minute + 2 seconds is processed.给定forBoundedOutOfOrderness(Duration.ofSeconds(2)) ,间隔[t, t + 1 minute)的 window 直到处理timestamp >= t + 1 minute + 2 seconds的事件之后才会关闭。

If your input stream can have long periods of idleness, and you can't wait until the stream resumes, then you'll have to either artificially advance the watermark after detecting idleness, or use a custom window Trigger that uses a combination of both event-time and processing-time timers.如果您的输入 stream 可能有很长的空闲时间,并且您不能等到 stream 恢复,那么您将不得不在检测到空闲后人为地推进水印,或者使用自定义Trigger的组合事件-time 和 processing-time 计时器。

For a watermark generator that detects idleness, here's an example , but it hasn't been updated to the new WatermarkStrategy API.对于检测空闲的水印生成器,这里有一个示例,但它尚未更新为新的WatermarkStrategy API。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM