简体   繁体   English

如何为 Flink DataStream 执行简单的中值算法(最好在 Java 和 Flink 1.14 中)?

[英]How do I perform a simple median algorithm for a Flink DataStream (preferably in Java and Flink 1.14)?

I have a datastream in Flink of messages that look like: (Name, Place, Number, Time)我在 Flink 中有一个看起来像这样的消息数据流:( (Name, Place, Number, Time)

I want to keep track of the median number per key.我想跟踪每个键的中位数。

To make matters a little more complicated....让事情变得更复杂一些......

Lets say I have the messages: (Jonah, Mars, 1, 1:00) (Jonah, Mars, 2, 1:01) (Jonah, Moon, 3, 1:02) (Jonah, Earth, 4, 1:03)假设我有消息: (Jonah, Mars, 1, 1:00) (Jonah, Mars, 2, 1:01) (Jonah, Moon, 3, 1:02) (Jonah, Earth, 4, 1:03)

I want to take a median using only the most recent messages per place ie, using just: (Jonah, Mars, 2, 1:01) (Jonah, Moon, 3, 1:02) (Jonah, Earth, 4, 1:03)我想仅使用每个地方的最新消息来取中值,即仅使用: (Jonah, Mars, 2, 1:01) (Jonah, Moon, 3, 1:02) (Jonah, Earth, 4, 1:03)

Here the answer is 3这里的答案是3

(Jonah, Mars, 1, 1:00) was not included because (Jonah, Mars, 1, 1:01) is more recent (Jonah, Mars, 1, 1:00)不包括在内,因为(Jonah, Mars, 1, 1:01)是最近的

My assumption is that it will look like:我的假设是它看起来像:

inputStream
            .keyBy(message -> message.name)
            .window(SlidingEventTimeWindows.of(30,1))
            .<MEDIAN FUNCTION>

I am guessing the answer would leverage MapState , though I am no sure how to use windowed MapState ...我猜答案会利用MapState ,虽然我不知道如何使用窗口化MapState ...

Note: Here is a similar question .注意:这是一个类似的问题 The advice here was not to do it.... unfortunately though, I need a median :(这里的建议是不要这样做....不幸的是,我需要一个中位数:(

One solution would be to use a KeyedProcessFunction , where the keys are names.一种解决方案是使用KeyedProcessFunction ,其中键是名称。 Then in keyed state you can keep MapState that maps from locations to the most recent event for that location (for that name).然后在键控状态下,您可以保留从位置映射到该位置(对于该名称)的最新事件的MapState

Then when you want to produce a result, you'll have to walk the map.然后,当您想产生结果时,您将不得不走地图。

This is somewhat painful, but I don't have a better idea.这有点痛苦,但我没有更好的主意。 If you are performance sensitive, need to use this at large scale, and don't need an exact answer, you could use a t-digest sketch instead.如果您对性能敏感,需要大规模使用它,并且不需要确切的答案,则可以改用 t-digest 草图。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM