简体   繁体   English

强制逐出滑动事件窗口以在Flink上进行处理(历史流)

[英]Force eviction of sliding event windows for processing (historical streams) on Flink

Currently, I am using Flink to conduct research on stream processing engines. 目前,我正在使用Flink对流处理引擎进行研究。 For my study, I work with historical streams, which consist of tuples of the following form: 在我的研究中,我使用历史流,其中包括以下形式的元组:

event_time, attribute_1, ..., attribute_X

where event_time is used as TimeCharacteristic.EventTime during processing. 其中event_time在处理期间用作TimeCharacteristic.EventTime Furthermore, I push my datasets into the processing topology, by either: (i) creating in-memory structures, or (ii) by reading the CSV files themselves. 此外,我可以通过以下方式将我的数据集推入处理拓扑中:(i)创建内存中的结构,或(ii)通过读取CSV文件本身。

Unfortunately, I have noticed that even if enough tuples have arrived in a window operator that complete a full window, that window is not pushed downstream for processing. 不幸的是,我注意到,即使足够多的元组到达了完成完整窗口的窗口运算符中,该窗口也没有被推向下游进行处理。 As a result, the performance significantly drops and many times I have an OutOfMemoryError exception (with large historical streams). 结果,性能显着下降,并且很多时候我遇到OutOfMemoryError异常(具有大量历史流)。

To illustrate a typical use-case, I present the following example: 为了说明一个典型的用例,我给出以下示例:

StreamExecutionEnvironment env = 
    StreamExecutionEnvironment.createLocalEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
env.setMaxParallelism(1);
List<Tuple2<Long, Integer>> l = new ArrayList<>();
    l.add(new Tuple2<>(1L, 11));
    l.add(new Tuple2<>(2L, 22));
    l.add(new Tuple2<>(3L, 33));
    l.add(new Tuple2<>(4L, 44));
    l.add(new Tuple2<>(5L, 55));
    DataStream<Tuple2<Long, Integer>> stream = env.fromCollection(l);
    stream.assignTimestampsAndWatermarks(
        new AscendingTimestampExtractor<Tuple2<Long, Integer>>() {
            @Override
            public long extractAscendingTimestamp(Tuple2<Long, Integer> t) {
                return t.f0;
            }
        })
        .windowAll(SlidingEventTimeWindows.of(Time.milliseconds(2), 
                Time.milliseconds(1)))
        .sum(1)
        .print();
    env.execute();

According to l 's contents, I need to have the following windowed results: 根据l的内容,我需要显示以下窗口结果:

  • [0, 2) Sum: 11 [0,2)总和:11
  • [1, 3) Sum: 33 [1,3)总和:33
  • [2, 4) Sum: 55 [2,4)总和:55
  • [3, 5) Sum: 77 [3,5)总和:77
  • [4, 6) Sum: 99 [4,6)总和:99
  • [5, 7) Sum: 55 [5,7)总和:55

Each list item can be read as [start-timestamp, end-timestamp), Sum: X. 每个列表项都可以读取为[开始时间戳记,结束时间戳记],总和:X。

I expect Flink to produce a windowed result every time a tuple with a timestamp beyond the end-timestamp of an open window appears. 我希望每当具有超过打开的窗口的结束时间戳的时间戳的元组出现时,Flink就会产生开窗结果。 For instance, I expect the summation for window [1, 3) to be produced when the tuple with timestamp 4L is fed into the window operator. 例如,我希望将带有时间戳4L的元组输入到窗口运算符中时,将生成窗口[1、3)的总和。 However, the processing initiates when all the tuples from l are pushed into the stream's topology. 但是,当来自l所有元组都推入流的拓扑时,处理开始。 The same thing happens when I work with larger historical streams, which results in degraded performance (or even depleting memory). 当我使用较大的历史流时,也会发生相同的情况,这会导致性能下降(甚至耗尽内存)。

Question: How can I force Flink to push windows downstream for processing by the time a window is complete? 问题:如何强制Flink在窗口完成时将窗口推向下游以进行处理?

I believe that for SlidingEventTimeWindows the eviction of a window is triggered with watermarks. 我相信,对于SlidingEventTimeWindows ,使用水印触发窗口的逐出。 If the previous is true, how can I write my topologies so that they trigger windows by the time a tuple with a later timestamp arrives? 如果前一个是正确的,我该如何编写拓扑,以使它们在具有较晚时间戳记的元组到达时触发窗口?

Thank you 谢谢

AscendingTimestampExtractor uses the periodic watermarking strategy, in which Flink will call the getCurrentWatermark() method every n milliseconds, where n is the autowatermarkinterval . AscendingTimestampExtractor使用定期水印策略,其中Flink每n毫秒调用一次getCurrentWatermark()方法,其中n是autowatermarkinterval

The default interval is 200 milliseconds, which is very long compared to the size of your windows. 默认间隔为200毫秒,与您的窗口大小相比,该间隔非常长。 However, they aren't directly comparable -- the 200 msec is measured in processing time, not event time. 但是,它们不是直接可比的-200毫秒是根据处理时间而非事件时间来衡量的。 Nevertheless, I suspect that if you haven't changed this configuration setting, then a lot of windows are created before the first watermark is emitted, which I think explains what you are seeing. 但是,我怀疑如果您没有更改此配置设置,那么在发出第一个水印之前会创建很多窗口,我认为这可以解释您所看到的内容。

You could reduce the auto-watermarking interval (perhaps to 1 millisecond). 您可以减少自动加水印的间隔(也许为1毫秒)。 Or you could implement an AssignerWithPunctuatedWatermarks , which will give you more control. 或者,您可以实现AssignerWithPunctuatedWatermarks ,这将给您更多控制权。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM