I am trying to join two unbounded PCollection that I am getting from 2 different kafka topics on the basis of a key.
As per the docs and other blogs a join can only be possible if we do windowing. Window collects the messages from both the streams in a particular window and joins it. Which is not what I need.
The result expected is in one stream the messages are coming at a very low frequency, and from other stream we are getting messages at a high frequency. I want that if the value of the key has not arrived on both the streams we won't do a join till then and after it arrives do the join. Is it possible using the current beam paradigm ?
In short, the best solution is to use stateful DoFn in Beam. You can have a per key state (and per window, which is global window in your case).You can save one stream events in state and once events from another stream appear with the same key, join it with events in state. Here is a reference[1].
However, the short answer does not utilize true power of Beam model. The Beam model provides ways to balance among latency, cost and accuracy. It provides simple API to hide complex of streaming processing.
Why I am saying that? Let's go back to the short answer's solution: stateful DoFn. In stateful DoFn approach, you are lack of ways to address following questions:
<1, a>
from left stream on <1, b> from right stream. Later there is another <1, c>
from left stream, how do you know that you only need to emit <1, <c, b>>
, assume this is incremental mode to output result. If you start to buffer those already joined events to get delta, that really becomes too complicated for a programmer.Beam's windowing, trigger, refinement on output data, watermark and lateness SLA control are designed to hide these complex from you:
Although Beam model is well designed. The implementation of Beam model are missing critical features to support the join you described:
To conclude, Beam model is designed to handle complex in streaming processing, which perfectly fits your need. But the implementation is not good enough to let you use it now to finish your join use case.
[1] https://beam.apache.org/blog/2017/02/13/stateful-processing.html
This isn't something that is well supported by the Beam model today, but there are a few ways you can do it. These examples assume each key appears exactly once on each stream, if that isn't the case you'll need to adjust them.
One option is to use the Global Window and Stateful DoFn instead of a Join. The Global Window effectively turns windowing off. A stateful DoFn lets you store data about the key you are processing in a "state cell" for later use. When you receive a record, you would check the state cell for a value. If you find one, do the join, emit the value, and clear the state. If there isn't anything, store the current value.
Another option is to use Session Windows and Join. The session window "GapDuration" is effectively a timeout on a given key. This works as long as you have a time bound in which you will see the Key on both streams. You'll also want to setup an element count trigger "AfterPane.elementCountAtLeast(2)" so you don't have to wait for the full timeout after seeing the second piece of data.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.