简体   繁体   中英

How do I perform a simple median algorithm for a Flink DataStream (preferably in Java and Flink 1.14)?

I have a datastream in Flink of messages that look like: (Name, Place, Number, Time)

I want to keep track of the median number per key.

To make matters a little more complicated....

Lets say I have the messages: (Jonah, Mars, 1, 1:00) (Jonah, Mars, 2, 1:01) (Jonah, Moon, 3, 1:02) (Jonah, Earth, 4, 1:03)

I want to take a median using only the most recent messages per place ie, using just: (Jonah, Mars, 2, 1:01) (Jonah, Moon, 3, 1:02) (Jonah, Earth, 4, 1:03)

Here the answer is 3

(Jonah, Mars, 1, 1:00) was not included because (Jonah, Mars, 1, 1:01) is more recent

My assumption is that it will look like:

inputStream
            .keyBy(message -> message.name)
            .window(SlidingEventTimeWindows.of(30,1))
            .<MEDIAN FUNCTION>

I am guessing the answer would leverage MapState , though I am no sure how to use windowed MapState ...

Note: Here is a similar question . The advice here was not to do it.... unfortunately though, I need a median :(

One solution would be to use a KeyedProcessFunction , where the keys are names. Then in keyed state you can keep MapState that maps from locations to the most recent event for that location (for that name).

Then when you want to produce a result, you'll have to walk the map.

This is somewhat painful, but I don't have a better idea. If you are performance sensitive, need to use this at large scale, and don't need an exact answer, you could use a t-digest sketch instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM