简体   繁体   中英

Splitting & joining streams in apache flink

I think I have a rather non-standard usecase. I want to split my source-stream into several streams using the filter function:

val dataStream:DataStream[MyEvent] = ...
val s1 = dataStream.filter(...).map(...)
val s2 = dataStream.filter(...).map(...)

I also have a timestamp extractor (the incoming events will have a timestamp attached to them in XML):

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
...

dataStream.assignTimestampsAndWatermarks(new MyTimestampExtractor)
...

class MyTimestampExtractor extends AssignerWithPunctuatedWatermarks[Elem]
{
  override def checkAndGetNextWatermark(lastElement:Elem, extractedTimestamp:Long):Watermark = new Watermark(extractedTimestamp)
  override def extractTimestamp(element:Elem, previousElementTimestamp:Long):Long = XmlOperations.getDateTime(element, "@timestamp").getMillis
}

I chose this approach since and not simply do a single stream ( val s = dataStream.filter(...).map(...).filter(...).map(...) ) because I want to build a network which splits/combines arbitrary streams (eg s1+s2->c1, s1+s3->c2, c2+s4->c3, ...)

Now when sending events through the above example, it might be that event E1 ends up in both s1 and s2. This means, in my understanding, that the very same event E1 is put as first instance into s1 (E1a) and also put as second instance into s2 (E1b).

So what I want to do now is to re-unite E1a and E1b into a combined E1 which resembles the E1 which as both the transformations of s1 and s2.

I tried:

val c1 = s1.join(s2)
  .where(_.key).equalTo(_.key)
  .window(TumblingEventTimeWindows.of(Time.seconds(10)))
  .apply((e1a, e2b) => { printf("Got e1a and e1b"); e1a })

However it seems like the events never reach the apply function, and I can't find out why.

What is wrong in my example? Will my approach / idea of a network of streams like this work at all?

Have you arranged for there to be watermarks? When working with event time, a window will only be triggered when a watermark arrives that advances the event time clock past the end of the window. You do this with a timestamp extractor / watermark generator; see an example from the documentation for more details.

If one of the streams is sometimes idle that can also cause problems, as the lack of watermarks on the idle stream will hold back watermarks for any streams it is connected to.

Depending on exactly what you are trying to do, you may find it easier to use a CoProcessFunction than a time-windowed join. Take a look at the exercises on stateful enrichment and expiring state from the Flink training site for examples.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM