简体   繁体   中英

Apache Flink enrichment

I have a source of events that looks like this

class Event {
    String userName;
    String webPage;
}

I need to enrich my stream of events with the past web pages access of the user. (I have the information in a DB and I can use it as a Flink source )

class EventStats {
    String userName;
    Map<String,Integer> webPageCounters; 
}

How do I make sure that before I start the processing of Event Stream I will have enrichment data ready for me?
I do not want to do DB calls from inside my stream.

It may be a struggle to do this with Flink tbh. The first idea that comes to mind is to do a db scan and create a separate stream when the job is started. That stream could be used for initialization and You could simply union that with actual EventStats stream, but this is not currently possible due to this issue. So, basically there are two solutions that can be used.

First one is quite simple, so if You are doing the join manually, You can keep the elements from Event stream, that do not have matching EventStats . If You receive EventStats You simply check if there is any Event matching that can be emitted. You probably should also have a logic that removes elements from state after some time if those are not matched.

The other solution is a little bit trickier, but also more elegant. So, basically You can implement custom operator that does implement InputSelectable , in a way that it first tries to consume everything from the EventStats and only after that it starts reading elements of Event Stream. There are some caveats with that, You can refer to the documentation for more info. Also, note that InputSelectable was introduced in Flink 1.9.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM