简体   繁体   中英

Apache Beam Session Windowing and joining across PCollections

we have two Streams S1 and S2 of events that have the same keys (userId). Is it possible to apply a session Window across both collections so that an occurrence of Key X in either stream would contribute to the session? Would this create Windows across PCollections that would let us join these afterwards?

For Context:

  • We are using the DataFlowRunner
  • both S1 and S2 are unbounded collections from PubSubIO

Many Thanks!

This is correct - you can do this because windows come into play when you perform grouping operations. This means that you can do something simple like this:

p = beam.Pipeline(...)

# Assume that timestamp information is already in the streams
first_stream = p | ReadMyFirstStream() | beam.WindowInto(window.Sessions(...))
second_stream = p | ReadMySecondStream() | beam.WindowInto(window.Sessions(...))

joined_streams = (
    {'first': first_stream,
     'second': second_stream}
    | beam.CoGroupByKey())

The joined streams PCollection will generate windows where elements from both streams are grouped together.


This will work in Java as well. I answered using Python for the sake of simplicity. Let me know if you'd prefer Java code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM