简体   繁体   English

使用Apache Beam进行窗口化 - 固定Windows似乎不会关闭?

[英]Windowing with Apache Beam - Fixed Windows Don't Seem to be Closing?

We are attempting to use fixed windows on an Apache Beam pipeline (using DirectRunner ). 我们正在尝试在Apache Beam管道上使用固定窗口(使用DirectRunner )。 Our flow is as follows: 我们的流程如下:

  1. Pull data from pub/sub 从pub / sub中提取数据
  2. Deserialize JSON into Java object 将JSON反序列化为Java对象
  3. Window events w/ fixed windows of 5 seconds 窗口事件w /固定窗口5秒
  4. Using a custom CombineFn , combine each window of Event s into a List<Event> 使用自定义CombineFn ,将Event s的每个窗口组合成List<Event>
  5. For the sake of testing, simply output the resulting List<Event> 为了测试,只需输出结果List<Event>

Pipeline code: 管道代码:

    pipeline
                // Read from pubsub topic to create unbounded PCollection
                .apply(PubsubIO
                    .<String>read()
                    .topic(options.getTopic())
                    .withCoder(StringUtf8Coder.of())
                )

                // Deserialize JSON into Event object
                .apply("ParseEvent", ParDo
                    .of(new ParseEventFn())
                )

                // Window events with a fixed window size of 5 seconds
                .apply("Window", Window
                    .<Event>into(FixedWindows
                        .of(Duration.standardSeconds(5))
                    )
                )

                // Group events by window
                .apply("CombineEvents", Combine
                    .globally(new CombineEventsFn())
                    .withoutDefaults()
                )

                // Log grouped events
                .apply("LogEvent", ParDo
                    .of(new LogEventFn())
                );

The result we are seeing is that the final step is never run, as we don't get any logging. 我们看到的结果是最后一步永远不会运行,因为我们没有得到任何记录。

Also, we have added System.out.println("***") in each method of our custom CombineFn class, in order to track when these are run, and it seems they don't run either. 此外,我们在自定义CombineFn类的每个方法中添加了System.out.println("***") ,以便跟踪它们何时运行,并且它们似乎也不运行。

Is windowing set up incorrectly here? 窗口设置不正确吗? We followed an example found at https://beam.apache.org/documentation/programming-guide/#windowing and it seems fairly straightforward, but clearly there is something fundamental missing. 我们按照https://beam.apache.org/documentation/programming-guide/#windowing中的一个示例进行了操作,看起来相当简单,但显然有一些基本缺失。

Any insight is appreciated - thanks in advance! 感谢任何见解 - 提前感谢!

Looks like the main issue was indeed a missing trigger - the window was opening and there was nothing telling it when to emit results. 看起来主要问题确实是一个缺失的触发器 - 窗口打开了,没有什么可以告诉它何时发出结果。 We wanted to simply window based on processing time (not event time) and so did the following: 我们想根据处理时间(而不是事件时间)简单地窗口,所以做了以下事情:

.apply("Window", Window
    .<Event>into(new GlobalWindows())
    .triggering(Repeatedly
        .forever(AfterProcessingTime
            .pastFirstElementInPane()
            .plusDelayOf(Duration.standardSeconds(5))
        )
    )
    .withAllowedLateness(Duration.ZERO).discardingFiredPanes()
)

Essentially this creates a global window, which is triggered to emit events 5 seconds after the first element is processed. 本质上,这会创建一个全局窗口,触发在处理第一个元素5秒后发出事件。 Every time the window is closed, another is opened once it receives an element. 每次关闭窗口时,一旦窗口收到元素,另一个窗口就会打开。 Beam complained when we didn't have the withAllowedLateness piece - as far as I know this just tells it to ignore any late data. 当我们没有withAllowedLateness片段时梁抱怨 - 据我所知这只是告诉它忽略任何后期数据。

My understanding may be a bit off the mark here, but the above snippet has solved our problem! 我的理解可能有点偏僻,但上面的片段已经解决了我们的问题!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Apache Beam 的 Fixed Windowing 仅触发一次元素 - Trigger elements exactly once using Fixed Windowing with Apache Beam Apache Beam Session 跨 PCollection 开窗和连接 - Apache Beam Session Windowing and joining across PCollections Apache 光束窗口:考虑晚期数据但只发出一个窗格 - Apache beam windowing: consider late data but emit only one pane 连接两个数据流的正确 Apache 波束窗口策略 - Correct Apache beam windowing strategy for joining two streams of data DataFlow (Apache Beam) 中发布/订阅的自定义时间戳和窗口 - Custom timestamp and windowing for Pub/Sub in DataFlow (Apache Beam) 如何将 Beam SQL 窗口查询与 KafkaIO 集成? - How to integrate Beam SQL windowing query with KafkaIO? Apache Beam S3 文件系统扩展始终需要 aws 区域输入,即使在我的项目中不使用 AWS 的其他管道中也是如此 - Apache Beam S3 filesystem extension always requires aws region input even in other pipelines within my project that don't us AWS Spring MVC 视图似乎不起作用 - Spring MVC views don't seem to be working 不同的舍入方法似乎不起作用 - Different rounding methods don't seem to work 不变泛型似乎无法正常工作 - Invariant Generics don't seem working correctly
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM