[英]Stream Analytics:When using a TUMBLING WINDOW is the start time of the window start based on the earliest time in the stream or start time of the job?
I have been reading documentation on how the TUMBLINGWINDOW function is used along with the TIMSTAMP BY clause and can't seem to find a clear explanation on how the start date of a query which contains a TUMBLING WINDOW and TIMESTAMP BY field is calculated (must have missed it if it is present somewhere).我一直在阅读有关如何将 TUMBLINGWINDOW function 与 TIMSTAMP BY 子句一起使用的文档,但似乎找不到关于如何计算包含 TUMBLING WINDOW 和 TIMESTAMP BY 字段的查询的开始日期的明确解释(必须有如果它存在于某处,则会错过它)。
Here are the links to the documentation which I have been looking at:以下是我一直在查看的文档的链接:
I am quoting below the Time Consideration section in the TUMBLING WINDOW LINK (which is the primary source from which my question has arose)我在 TUMBLING WINDOW LINK 的时间考虑部分下面引用(这是我的问题出现的主要来源)
Time Consideration
时间考虑
"Every window operation outputs event at the end of the window. The windows of Azure Stream Analytics are opened at the window start time and closed at the window end time. For example, if you have a 5 minute window from 12:00 AM to 12:05 AM all events with timestamp greater than 12:00 AM and up to timestamp 12:05 AM inclusive will be included within this window. The output of the window will be a single event based on the aggregate function used with a timestamp equal to the window end time. The timestamp of the output event of the window can be projected in the SELECT statement using the System.Timestamp() property using an alias."
"Every window operation outputs event at the end of the window. The windows of Azure Stream Analytics are opened at the window start time and closed at the window end time. For example, if you have a 5 minute window from 12:00 AM to 12:05 AM 所有时间戳大于 12:00 AM 且时间戳为 12:05 AM(含)的事件都将包含在此 window 中。window 的 output 将是基于聚合 function 的单个事件,时间戳等于到 window 结束时间。window 的 output 事件的时间戳可以使用别名使用 System.Timestamp() 属性投影到 SELECT 语句中。
It mentions a 5 minute window however doesn't seem to go into detail about why the 5 minute windows are started at this time and most importantly how this would generalise.它提到了 5 分钟 window 但是似乎没有详细说明为什么 5 分钟 windows 在这个时候开始,最重要的是这将如何概括。
Note: I understand that this point might have been out of scope for this documentation but I haven't managed to find a clear explanation of this elsewhere either.注意:我知道这一点可能超出了本文档的 scope,但我也没有在其他地方找到对此的明确解释。
Say I have the following code (copied from docs with small modifications)假设我有以下代码(从文档中复制并稍作修改)
SELECT System.Timestamp() AS WindowEnd, TollId, COUNT(*)
FROM Input TIMESTAMP BY EntryTime
GROUP BY TumblingWindow(Duration(day, 1)), TollId
Up until now I have been on the assumption that whatever field I choose in the TIMESTAMP BY clause (eg EntryTime in the above code snippet) would define the field on which the window is created and then depending on the TUMBLINGWINDOW function arguments chosen (eg day wise in the above code snippet) would handle how the chosen timestamp field is "windowed" or sliced.到目前为止,我一直假设我在 TIMESTAMP BY 子句中选择的任何字段(例如上面代码片段中的 EntryTime)都会定义创建 window 的字段,然后根据选择的 TUMBLINGWINDOW function arguments(例如天wise 在上面的代码片段中)将处理所选时间戳字段的“窗口化”或切片方式。 Stream Analytics would then handle the window creation based on the earliest dates present in the source time field at the time of job starting (eg even if I start a job at 2022-02-22 T09:00:00 UTC if the data is present for the day 2022-02-21 then then the query would output for that day of data 2022-02-21T00:00:00 UTC to 2022-02-22T00:00:00 UTC since that would have passed by this point and the current window (2022-02-22T00:00:00 to 2022-02-23T00:00:00) would populate once that window is finished.
Stream 然后,Analytics 将根据工作开始时源时间字段中出现的最早日期处理 window 创建(例如,如果数据存在,即使我在 2022-02-22 T09:00:00 UTC 开始工作对于 2022-02-21 日,那么对于当天的数据 2022-02-21T00:00:00 UTC 到 2022-02-22T00:00:00 UTC 的查询将为 output,因为到此为止当前 window(2022-02-22T00:00:00 到 2022-02-23T00:00:00)将在 window 完成后填充。
From the documentation here: https://learn.microsoft.com/en-us/stream-analytics-query/windowing-azure-stream-analytics#understanding-windows从这里的文档: https://learn.microsoft.com/en-us/stream-analytics-query/windowing-azure-stream-analytics#understanding-windows
Every window operation outputs event at the end of the window. The windows of Azure Stream Analytics are opened at the window start time and closed at the window end time.
每个 window 操作在 window 结束时输出事件。 Azure windows Stream 分析在 window 开始时间打开,在 88234708391978 结束时间关闭。 For example, if you have a 5 minute window from 12:00 AM to 12:05 AM all events with timestamp greater than 12:00 AM and up to timestamp 12:05 AM inclusive will be included within this window. The output of the window will be a single event based on the aggregate function used with a timestamp equal to the window end time.
例如,如果您有一个 5 分钟的 window 从 12:00 AM 到 12:05 AM,所有时间戳大于 12:00 AM 到时间戳 12:05 AM 的事件都将包含在此 window 中。 output window 将是一个基于聚合 function 的单个事件,时间戳等于 window 结束时间。 The timestamp of the output event of the window can be projected in the SELECT statement using the System.Timestamp() property using an alias.
window 的 output 事件的时间戳可以使用别名使用 System.Timestamp() 属性投影到 SELECT 语句中。 Every window automatically aligns itself to the zeroth hour.
每个 window 都会自动对齐到第零小时。 For example, a 5 minute tumbling window will align itself to (12:00-12:05], (12:05-12:10], ..., and so on.
例如,一个 5 分钟的翻滚 window 将自己对齐到 (12:00-12:05], (12:05-12:10], ..., 等等。
If you have historical data that you want to output, you can set a custom query start time either as any point up to the max cache of your streaming source (ususally 7 days) or as at the point the query was last stopped, so you don't lose any data during maintenance windows.如果您有想要 output 的历史数据,您可以将自定义查询开始时间设置为流媒体源的最大缓存(通常为 7 天)或查询上次停止的时间点,这样您维护期间不要丢失任何数据 windows。
The query, however, will only output data with a timestamp that is after the query start time.但是,该查询将仅包含时间戳在查询开始时间之后的 output 数据。
Therefore, imagine that your first data has a timestamp of 2022-02-20 01:23:00
and your second a timestamp of 2022-02-21 15:08:00
.因此,假设您的第一个数据的时间戳为
2022-02-20 01:23:00
,第二个数据的时间戳为2022-02-21 15:08:00
。 You start your streaming job as at 2022-02-21 14:00:00
, so your 10 minute windows base themselves on the midnight of the 21st and then progress in 10 minute windows from there.您从
2022-02-21 14:00:00
开始您的流媒体工作,因此您的 10 分钟 windows 以 21 日午夜为基础,然后从那里开始 10 分钟 windows。 The query does not output anything until the 15:00 - 15:10
window of the 21st, as this is the first window that is both post your query start time and contains data. output 直到 21 日的
15:00 - 15:10
window 才执行任何查询,因为这是第一个 window,它既发布查询开始时间又包含数据。 In this scenario you can see how the windows work and why your data with the 2022-02-20 01:23:00
timestamp would not be output.在这种情况下,您可以看到 windows 是如何工作的,以及为什么具有
2022-02-20 01:23:00
时间戳的数据不会是 output。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.