简体   繁体   English

如何根据Apache Flink中的第二个密钥拆分window?

[英]How to split a window based on a second key in Apache Flink?

I am trying to create a data stream processing of a product scanner which generates events in the form of the following Tuple4: Timestamp(long, in milliseconds), ClientID(int), ProductID(int), Quantity(int).我正在尝试创建产品扫描仪的数据 stream 处理,它以以下 Tuple4 的形式生成事件:时间戳(长,以毫秒为单位)、ClientID(int)、ProductID(int)、Quantity(int)。

At the end, a stream of Tuple3 should be obtained: ClientID(int), ProductID(int), Quantity(int) which represents a grouping of all the products with the same ProductID purchased by one client with a given ClientID.最后,应该得到一个stream的Tuple3:ClientID(int), ProductID(int), Quantity(int) 代表一个给定ClientID的客户购买的具有相同ProductID的所有产品的一组。 For any "transaction" there can be a maximum of a 10 seconds gap between product scans.对于任何“交易”,产品扫描之间最多可以有 10 秒的间隔。

This is a short snippet of code that shows my initial attempt:这是一小段代码,显示了我最初的尝试:

        DataStream<Tuple4<Long, Integer, Integer, Integer>> inStream = ...;

        WindowedStream<Tuple4<Long, Integer, Integer, Integer>, Integer, TimeWindow> windowedStream = inStream
            .keyBy((tuple) -> Tuple2.of(tuple.f1, tuple.f2))
            .window(EventTimeSessionWindows.withGap(Time.seconds(10)));
        
        windowedStream.aggregate(...); // Drop timestamp, sum quantity, keep the rest the same

However, this is where the issue comes in. Normally, a SessionWindow would be enough, but in this case it implements a gap of 10 seconds between 2 events with the key (ClientID, ProductID), which is not what is expected.然而,这就是问题所在。通常情况下,一个 SessionWindow 就足够了,但在这种情况下,它在 2 个具有键(ClientID、ProductID)的事件之间实现了 10 秒的间隔,这不是预期的。

If we imagine the following tuples coming in:如果我们想象以下元组进入:

  1. (10_000, 1, 1, 1) <6 second gap> (10_000, 1, 1, 1) <6 秒间隔>
  2. (16_000, 1, 2, 1) <6 second gap> (16_000, 1, 2, 1) <6 秒间隔>
  3. (22_000, 1, 1, 1) <6 second gap> (22_000, 1, 1, 1) <6 秒间隔>
  4. (28_000, 1, 2, 1) (28_000, 1, 2, 1)

The sequence of tuples should be in the same SessionWindow, and 1 and 2 should be merged with 3, respectively 4, generating two output events.元组序列应该在同一个SessionWindow中,1和2应该分别和3,4合并,产生两个output事件。 However, they are not in the same SessionWindow, because 1+3 and 2+4 are split in their separate streams by the keyBy and they are not aggregated since they do not fulfill the requirement of max 10 seconds between products.但是,它们不在同一个 SessionWindow 中,因为 1+3 和 2+4 被 keyBy 拆分到各自的流中,并且它们没有聚合,因为它们不满足产品之间最多 10 秒的要求。

I am wondering if there is a way to solve this with the application of a "second" key.我想知道是否有办法通过应用“第二个”密钥来解决这个问题。 First, the stream should be split based on the key ClientID, and then a SessionWindow should be applied (irrespective of the product).首先,stream应该根据关键的ClientID进行拆分,然后应用一个SessionWindow(与产品无关)。 Following that, I was wondering if there is a way to subdivide the ClientID-keyed SessionWindows with the use of the second key (which would be ProductID) and effectively reach the same key as before (ClientID, ProductID) without the previous issue.之后,我想知道是否有一种方法可以使用第二个密钥(将是 ProductID)细分以 ClientID 为键的 SessionWindows,并有效地达到与以前相同的密钥(ClientID、ProductID),而不会出现上一个问题。 Then, the aggregate could be applied normally to reach the expected output stream.然后可以正常应用聚合,达到预期的output stream。

If that is not possible, is there any other way of solving this?如果那不可能,还有其他解决方法吗?

The simplest way to solve it would be to just do partitioning base on the ClientID to capture all scans done by the particular client and then use process that would give You access to all elements in the paricular window, where You can generate separate events or outputs for every ProductID .解决它的最简单方法是仅根据ClientID进行分区以捕获特定客户端完成的所有扫描,然后使用可让您访问特定 window 中所有元素的process ,您可以在其中生成单独的事件或输出对于每个ProductID Is there any reason why that might not work in Your setup??有什么原因可能在您的设置中不起作用?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM