简体   繁体   English

窗口翻转在ksql中如何工作? 作为查询返回相同的结果,无论是否在ksql中使用窗口滚动

[英]How does window tumbling works in ksql? As query returning same result with or without using window tumbling in ksql

I am using ksql stream and calculating events coming every 5 minutes. 我正在使用ksql流并计算每5分钟出现一次的事件。 Here is my query - 这是我的查询-

select count(*), created_on_date from TABLE_NAME window tumbling (size 5 minutes) group by created_on_date;

Providing results - 提供结果-

2 | 2018-11-13 09:54:50
3 | 2018-11-13 09:54:49
3 | 2018-11-13 09:54:52
3 | 2018-11-13 09:54:51
3 | 2018-11-13 09:54:50

query without window tumbling - 没有窗口翻滚的查询-

select count(*), created_on_date from OP_UPDATE_ONLY group by created_on_date;

Result - 结果-

1 | 2018-11-13 09:55:08
2 | 2018-11-13 09:55:09
1 | 2018-11-13 09:55:10
3 | 2018-11-13 09:55:09
4 | 2018-11-13 09:55:12

Both queries returning same results, so how does window tumbling make difference? 两个查询都返回相同的结果,那么窗口翻滚有何不同?

The tumbling window is a rolling aggregation and counts the number of events based on a key within a given window of time. 滚动窗口是滚动式聚合,它基于给定时间窗口内的键来计算事件数。 The window of time is based on the timestamp of your stream, inherited from your Kafka message by default but overrideable by WITH (TIMESTAMP='my_column') . 时间窗口基于流的时间戳,默认情况下是从您的Kafka消息继承而来,但是可以由WITH (TIMESTAMP='my_column')覆盖。 So you could pass created_on_date as the timestamp column and then aggregate by the values there. 因此,您可以将created_on_date作为timestamp列传递,然后根据那里的值进行汇总。

The second one is over the entire stream of messages. 第二个是整个消息流。 Since you happen to have a timestamp in your message itself, grouping by that gives the illusion of a time-based aggregation. 由于您的消息本身恰好带有时间戳,因此按分组进行分组会给您基于时间的聚合的错觉。 However, if you wanted to find out how many events, for example, within an hour - this would be no use (you can only do a count at the grain of created_on_date ). 但是,如果您想找出例如一个小时内有多少事件-这将是没有用的(您只能对created_on_date进行计数)。

So the first example, with a window, is usually the correct way to do it because you usually want to answer a business question about an aggregation within a given time period , not over the course of an arbitrary stream of data. 因此,第一个带有窗口的示例通常是正确的方法,因为您通常希望在给定的时间内而不是在任意数据流的过程中)回答有关聚合的业务问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM