简体   繁体   English

连接两个数据流的正确 Apache 波束窗口策略

[英]Correct Apache beam windowing strategy for joining two streams of data

I have a use case where I need to read data from two streams and join them by a key.我有一个用例,我需要从两个流中读取数据并通过键连接它们。 One stream contains a send record and the other stream contains an acknowledgement to this send.一个 stream 包含一个发送记录,另一个 stream 包含对此发送的确认。 So I will find exactly 1 element with key say 123 in stream1 and exactly 1 element with key 123 in stream2.所以我会在stream1中找到1个键为123的元素,在stream2中找到1个键为123的元素。 The send and ack may arrive at any time in either streams but I can make an assumption that the send appears before the ack.发送和确认可能随时在任一流中到达,但我可以假设发送出现在确认之前。 The ack however may be lost, the sends are never lost by the system.然而,确认可能会丢失,系统永远不会丢失发送。 I want to limit the time I wait for an ack to say 3 minutes(to simplify).我想限制我等待 ack 说 3 分钟的时间(为了简化)。

What would be the best windowing strategy to use in this case?在这种情况下使用的最佳窗口策略是什么?

  1. If I use fixed windows, the send and ack may lie in two different windows and I wont be able to do a join.如果我使用固定的 windows,则发送和确认可能位于两个不同的 windows 中,我将无法加入。 I could possibly use the lateness API, but should I accumulate panes in this case?我可能会使用迟到 API,但在这种情况下我应该累积窗格吗?

  2. I tried using a session window with gap duration of 3 minutes, but I did not see a trigger at the end of 3 minutes.我尝试使用间隔持续时间为 3 分钟的 session window,但在 3 分钟结束时我没有看到触发器。 The join happened after 6 minutes.加入发生在 6 分钟后。 This is my code:这是我的代码:

    Join code:加入代码:

     private static <T> Window<T> window() { return Window.<T>into(Sessions.withGapDuration(Duration.standardMinutes(3))).triggering(AfterWatermark.pastEndOfWindow()).withAllowedLateness(Duration.ZERO).discardingFiredPanes(); } PCollection<KV<String, Record>> sendStream = KafkaIO.readRecord().withTopic(pipelineOptions.getInputTopic()).window(); PCollection<KV<String, Record>> ackStream = KafkaIO.readRecord().withTopic(pipelineOptions.getInputTopic()).window(); PCollection<KV<String,String>> joins = sendStream.apply("Joining Streams", Joins.innerJoin(ackStream)).apply(...);

    Output: Output:

    19-01-2022 11:41:19 Send: 1mtz7n-kxi88e7a-89
    19-01-2022 11:42:19 Ack: 1mtz7n-kxi88e7a-89
    19-01-2022 11:48:33 JoinedByKey: 1mtz7n-kxi88e7a-89

I see a delay ranging from 6 to 7 minutes.我看到延迟从 6 到 7 分钟不等。 I was under the impression that the session should only last for 3 minutes.我的印象是 session 应该只能持续 3 分钟。

I also want to trigger when an event is received in both streams.我还想在两个流中都收到事件时触发。 If I use earlyFiringTriggers, I see the joined data being output once when both elements are joined and next at the end of window.如果我使用 earlyFiringTriggers,当两个元素都连接时,我看到连接的数据是 output,然后是 window 的末尾。 I want to avoid this as well, but I am not able to configure my window method correctly.我也想避免这种情况,但我无法正确配置我的 window 方法。

Any advice?有什么建议吗?

This join will be best implemented in the global window using state to buffer the two sides of the join and timers to clean up the state.这种连接最好在全局 window 中实现,使用 state 缓冲连接的两侧,并使用计时器清理 state。 It can be generalized, so it is tracked as a feature request in https://issues.apache.org/jira/browse/BEAM-7386 .它可以泛化,因此在https://issues.apache.org/jira/browse/BEAM-7386 中将其作为功能请求进行跟踪。 There is an implementation under review/development at https://github.com/apache/beam/pull/15275 . https://github.com/apache/beam/pull/15275有一个正在审查/开发中的实现。 You may be able to use the details on the bug and that pull request to complete an implementation that is simplified and customized for you.您可以使用有关错误和拉取请求的详细信息来完成为您简化和定制的实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM