简体   繁体   中英

Correct Apache beam windowing strategy for joining two streams of data

I have a use case where I need to read data from two streams and join them by a key. One stream contains a send record and the other stream contains an acknowledgement to this send. So I will find exactly 1 element with key say 123 in stream1 and exactly 1 element with key 123 in stream2. The send and ack may arrive at any time in either streams but I can make an assumption that the send appears before the ack. The ack however may be lost, the sends are never lost by the system. I want to limit the time I wait for an ack to say 3 minutes(to simplify).

What would be the best windowing strategy to use in this case?

  1. If I use fixed windows, the send and ack may lie in two different windows and I wont be able to do a join. I could possibly use the lateness API, but should I accumulate panes in this case?

  2. I tried using a session window with gap duration of 3 minutes, but I did not see a trigger at the end of 3 minutes. The join happened after 6 minutes. This is my code:

    Join code:

     private static <T> Window<T> window() { return Window.<T>into(Sessions.withGapDuration(Duration.standardMinutes(3))).triggering(AfterWatermark.pastEndOfWindow()).withAllowedLateness(Duration.ZERO).discardingFiredPanes(); } PCollection<KV<String, Record>> sendStream = KafkaIO.readRecord().withTopic(pipelineOptions.getInputTopic()).window(); PCollection<KV<String, Record>> ackStream = KafkaIO.readRecord().withTopic(pipelineOptions.getInputTopic()).window(); PCollection<KV<String,String>> joins = sendStream.apply("Joining Streams", Joins.innerJoin(ackStream)).apply(...);

    Output:

    19-01-2022 11:41:19 Send: 1mtz7n-kxi88e7a-89
    19-01-2022 11:42:19 Ack: 1mtz7n-kxi88e7a-89
    19-01-2022 11:48:33 JoinedByKey: 1mtz7n-kxi88e7a-89

I see a delay ranging from 6 to 7 minutes. I was under the impression that the session should only last for 3 minutes.

I also want to trigger when an event is received in both streams. If I use earlyFiringTriggers, I see the joined data being output once when both elements are joined and next at the end of window. I want to avoid this as well, but I am not able to configure my window method correctly.

Any advice?

This join will be best implemented in the global window using state to buffer the two sides of the join and timers to clean up the state. It can be generalized, so it is tracked as a feature request in https://issues.apache.org/jira/browse/BEAM-7386 . There is an implementation under review/development at https://github.com/apache/beam/pull/15275 . You may be able to use the details on the bug and that pull request to complete an implementation that is simplified and customized for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM