简体   繁体   English

在 Flink 中使用相同的滑动窗口加入 2 个以上的流

[英]Joining more than 2 streams using the same sliding window in Flink

I have 3 streams A, B and C that I am supposed to join into a single stream lets call it ABC and do some operation on.我有 3 个流 A、B 和 C,我应该将它们加入一个流中,让我们称之为 ABC 并对其进行一些操作。

It is important that I use sliding windows with size X and slide Y where Y <= X*3重要的是我使用大小为 X 的滑动窗口和滑动 Y,其中 Y <= X*3

All the streams contain some common ID that I use for the join and X, Y are time parameters defined in seconds.所有的流都包含一些我用于连接的通用 ID,X、Y 是以秒为单位定义的时间参数。

My current implementation is to join stream A and B into AB using a tumbling window of with size X and then join AB with C using a sliding window with size X and slide Y.我目前的实现是使用大小为 X 的滚动窗口将流 A 和 B 连接到 AB 中,然后使用大小为 X 和幻灯片 Y 的滑动窗口将 AB 与 C 连接。

This may lead to incorrect answers in cases such as: Stream A receives a message at time 0, and Stream B receives a message at time Y+1.在以下情况下,这可能会导致错误答案:流 A 在时间 0 接收消息,流 B 在时间 Y+1 接收消息。 In this case both messages should go inside the same sliding window because Y+1 < X, but the end result is that when I join AB and C the message from B is missing due to the initial tumbling window.在这种情况下,两条消息都应该进入同一个滑动窗口,因为 Y+1 < X,但最终结果是当我加入 AB 和 C 时,由于初始滚动窗口,来自 B 的消息丢失。

Can I do a multi-stream join in Flink using a single sliding window similar to how I would do join multiple dataframes in Spark?我可以使用单个滑动窗口在 Flink 中进行多流连接,类似于在 Spark 中连接多个数据帧的方式吗?

I think what will work in this case is to use two sliding window joins -- one to compute AB, and another to join those results with C. The one issue you may have is with the timestamps on the records produced by the first join -- I'm not sure what timestamps Flink will put into the StreamRecords that wrap your AB events, but for normal (non-join) windows, Flink sets the timestamps on the result records to the window end time.我认为在这种情况下可以使用两个滑动窗口连接 - 一个用于计算 AB,另一个将这些结果与 C 连接。您可能遇到的一个问题是第一次连接生成的记录上的时间戳 - - 我不确定 Flink 会将哪些时间戳放入包装 AB 事件的 StreamRecords 中,但对于普通(非加入)窗口,Flink 将结果记录上的时间戳设置为窗口结束时间。 This may not be what you want in this case.在这种情况下,这可能不是您想要的。 If this is an issue, you can put an additional timestamp assigner after the first sliding window to set the timestamps appropriately, before the second join (with C).如果这是一个问题,您可以在第一个滑动窗口之后放置一个额外的时间戳分配器,以在第二次加入(使用 C)之前适当地设置时间戳。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM