简体   繁体   English

Stream 分析:使用TUMBLING时 WINDOW window的开始时间是根据stream中最早的时间还是job的开始时间?

[英]Stream Analytics:When using a TUMBLING WINDOW is the start time of the window start based on the earliest time in the stream or start time of the job?

Context语境

I have been reading documentation on how the TUMBLINGWINDOW function is used along with the TIMSTAMP BY clause and can't seem to find a clear explanation on how the start date of a query which contains a TUMBLING WINDOW and TIMESTAMP BY field is calculated (must have missed it if it is present somewhere).我一直在阅读有关如何将 TUMBLINGWINDOW function 与 TIMSTAMP BY 子句一起使用的文档,但似乎找不到关于如何计算包含 TUMBLING WINDOW 和 TIMESTAMP BY 字段的查询的开始日期的明确解释(必须有如果它存在于某处,则会错过它)。

Here are the links to the documentation which I have been looking at:以下是我一直在查看的文档的链接:

I am quoting below the Time Consideration section in the TUMBLING WINDOW LINK (which is the primary source from which my question has arose)我在 TUMBLING WINDOW LINK 的时间考虑部分下面引用(这是我的问题出现的主要来源)

Time Consideration时间考虑

"Every window operation outputs event at the end of the window. The windows of Azure Stream Analytics are opened at the window start time and closed at the window end time. For example, if you have a 5 minute window from 12:00 AM to 12:05 AM all events with timestamp greater than 12:00 AM and up to timestamp 12:05 AM inclusive will be included within this window. The output of the window will be a single event based on the aggregate function used with a timestamp equal to the window end time. The timestamp of the output event of the window can be projected in the SELECT statement using the System.Timestamp() property using an alias." "Every window operation outputs event at the end of the window. The windows of Azure Stream Analytics are opened at the window start time and closed at the window end time. For example, if you have a 5 minute window from 12:00 AM to 12:05 AM 所有时间戳大于 12:00 AM 且时间戳为 12:05 AM(含)的事件都将包含在此 window 中。window 的 output 将是基于聚合 function 的单个事件,时间戳等于到 window 结束时间。window 的 output 事件的时间戳可以使用别名使用 System.Timestamp() 属性投影到 SELECT 语句中。

It mentions a 5 minute window however doesn't seem to go into detail about why the 5 minute windows are started at this time and most importantly how this would generalise.它提到了 5 分钟 window 但是似乎没有详细说明为什么 5 分钟 windows 在这个时候开始,最重要的是这将如何概括。

Note: I understand that this point might have been out of scope for this documentation but I haven't managed to find a clear explanation of this elsewhere either.注意:我知道这一点可能超出了本文档的 scope,但我也没有在其他地方找到对此的明确解释。

Question(s)问题)

Say I have the following code (copied from docs with small modifications)假设我有以下代码(从文档中复制并稍作修改)

SELECT System.Timestamp() AS WindowEnd, TollId, COUNT(*)  
FROM Input TIMESTAMP BY EntryTime  
GROUP BY TumblingWindow(Duration(day, 1)), TollId
  • How is the start datetime of the window I create selected/chosen if I have a stream of data?如果我有 stream 数据,我创建的 window 的开始日期时间如何选择/选择?
    • Is it based on the earliest time within the EntryTime (which is what I am selecting as the timestamp by field) from which it then logically forms an initial window to encompass or is it dependent on when I start my stream job running (creating the windows from the time at which the job has started and after)?它是基于 EntryTime 中的最早时间(这是我按字段选择的时间戳),然后从逻辑上 forms 开始包含初始 window还是取决于我何时开始我的 stream 作业运行(创建 windows从工作开始的时间和之后)?
  • In the scenario that it is dependent on when I start the stream job what's the best way to make sure I start the window so that it includes all data wanted in the initial (and subsequent) windows (in my scenario aggregations by actual whole days eg start of 2022-02-22 T00:00:00 to 2022-02-23 T00:00:00)?在它取决于我何时开始 stream 工作的情况下,什么是确保我启动 window 的最佳方法,以便它包含初始(和后续)windows 中所需的所有数据(在我的场景中按实际全天聚合,例如从 2022-02-22 T00:00:00 开始到 2022-02-23 T00:00:00)?
    • Would it be to set the start time to custom and select it to start on the beginning of the day?是否将开始时间设置为自定义和 select 它在一天的开始? eg I set the job to start on 2022-02-22 T00:00:00 that way it will start the window at this time so that the first tumbling window (and subsequent ones) would include whole days of data starting from 2022-02-22 T00:00:00 on a days wise basis.例如,我将作业设置为在 2022-02-22 T00:00:00 开始,这样它将在此时启动 window,以便第一个翻滚 window(以及后续翻滚)将包括从 2022-02 开始的全天数据-22 T00:00:00 按天计算。

Thoughts思绪

Up until now I have been on the assumption that whatever field I choose in the TIMESTAMP BY clause (eg EntryTime in the above code snippet) would define the field on which the window is created and then depending on the TUMBLINGWINDOW function arguments chosen (eg day wise in the above code snippet) would handle how the chosen timestamp field is "windowed" or sliced.到目前为止,我一直假设我在 TIMESTAMP BY 子句中选择的任何字段(例如上面代码片段中的 EntryTime)都会定义创建 window 的字段,然后根据选择的 TUMBLINGWINDOW function arguments(例如天wise 在上面的代码片段中)将处理所选时间戳字段的“窗口化”或切片方式。 Stream Analytics would then handle the window creation based on the earliest dates present in the source time field at the time of job starting (eg even if I start a job at 2022-02-22 T09:00:00 UTC if the data is present for the day 2022-02-21 then then the query would output for that day of data 2022-02-21T00:00:00 UTC to 2022-02-22T00:00:00 UTC since that would have passed by this point and the current window (2022-02-22T00:00:00 to 2022-02-23T00:00:00) would populate once that window is finished. Stream 然后,Analytics 将根据工作开始时源时间字段中出现的最早日期处理 window 创建(例如,如果数据存在,即使我在 2022-02-22 T09:00:00 UTC 开始工作对于 2022-02-21 日,那么对于当天的数据 2022-02-21T00:00:00 UTC 到 2022-02-22T00:00:00 UTC 的查询将为 output,因为到此为止当前 window(2022-02-22T00:00:00 到 2022-02-23T00:00:00)将在 window 完成后填充。

From the documentation here: https://learn.microsoft.com/en-us/stream-analytics-query/windowing-azure-stream-analytics#understanding-windows从这里的文档: https://learn.microsoft.com/en-us/stream-analytics-query/windowing-azure-stream-analytics#understanding-windows

Every window operation outputs event at the end of the window. The windows of Azure Stream Analytics are opened at the window start time and closed at the window end time.每个 window 操作在 window 结束时输出事件。 Azure windows Stream 分析在 window 开始时间打开,在 88234708391978 结束时间关闭。 For example, if you have a 5 minute window from 12:00 AM to 12:05 AM all events with timestamp greater than 12:00 AM and up to timestamp 12:05 AM inclusive will be included within this window. The output of the window will be a single event based on the aggregate function used with a timestamp equal to the window end time.例如,如果您有一个 5 分钟的 window 从 12:00 AM 到 12:05 AM,所有时间戳大于 12:00 AM 到时间戳 12:05 AM 的事件都将包含在此 window 中。 output window 将是一个基于聚合 function 的单个事件,时间戳等于 window 结束时间。 The timestamp of the output event of the window can be projected in the SELECT statement using the System.Timestamp() property using an alias. window 的 output 事件的时间戳可以使用别名使用 System.Timestamp() 属性投影到 SELECT 语句中。 Every window automatically aligns itself to the zeroth hour.每个 window 都会自动对齐到第零小时。 For example, a 5 minute tumbling window will align itself to (12:00-12:05], (12:05-12:10], ..., and so on.例如,一个 5 分钟的翻滚 window 将自己对齐到 (12:00-12:05], (12:05-12:10], ..., 等等。

If you have historical data that you want to output, you can set a custom query start time either as any point up to the max cache of your streaming source (ususally 7 days) or as at the point the query was last stopped, so you don't lose any data during maintenance windows.如果您有想要 output 的历史数据,您可以将自定义查询开始时间设置为流媒体源的最大缓存(通常为 7 天)或查询上次停止的时间点,这样您维护期间不要丢失任何数据 windows。

The query, however, will only output data with a timestamp that is after the query start time.但是,该查询将仅包含时间戳在查询开始时间之后的 output 数据。

Therefore, imagine that your first data has a timestamp of 2022-02-20 01:23:00 and your second a timestamp of 2022-02-21 15:08:00 .因此,假设您的第一个数据的时间戳为2022-02-20 01:23:00 ,第二个数据的时间戳为2022-02-21 15:08:00 You start your streaming job as at 2022-02-21 14:00:00 , so your 10 minute windows base themselves on the midnight of the 21st and then progress in 10 minute windows from there.您从2022-02-21 14:00:00开始您的流媒体工作,因此您的 10 分钟 windows 以 21 日午夜为基础,然后从那里开始 10 分钟 windows。 The query does not output anything until the 15:00 - 15:10 window of the 21st, as this is the first window that is both post your query start time and contains data. output 直到 21 日的15:00 - 15:10 window 才执行任何查询,因为这是第一个 window,它既发布查询开始时间又包含数据。 In this scenario you can see how the windows work and why your data with the 2022-02-20 01:23:00 timestamp would not be output.在这种情况下,您可以看到 windows 是如何工作的,以及为什么具有2022-02-20 01:23:00时间戳的数据不会是 output。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 一个翻滚window的开始和结束时间如何获取? - How to get the start and end time of a tumbling window? Stream 分析:选择自动暂停一天的最佳参数 TUMBLINGWINDOW stream 作业和为该作业设置的最佳触发时间 function - Stream Analytics: Best parameters to choose for the autopause of a day wise TUMBLINGWINDOW stream job and best trigger time to set for that function 如何在作业 AWS Glue 中获取开始和结束时间? - How to get Start and End time in a Job AWS Glue? 按开始时间降序对 gcloud firestore 操作进行排序 - Sort gcloud firestore operations by start time descending Firebase 功能冷启动时间慢 - Firebase functions slow cold start time 在特定时间开始播放视频并跟踪视频 - Start a video at a particular time and keep track of the video Python 数据摄取(从对 AWS S3 存储桶的 API 调用“获取”开始),如何管理用户名/密码/API 密钥和令牌(在短时间窗口内过期) - Python Data Ingestion (Start with API call "Get" to AWS S3 Bucket), how to manage the username/pwd/api-key and token( expired in short time window) 具有特定条件时间开始和时间结束的 Date_diff - Date_diff with specific condition time start and time end 如何从 SQL 中的时间戳获取周开始日期 - How to get week start date from time stamp in SQL 基于 PubSub 通知启动的数据流作业 - Python - Dataflow Job to start based on PubSub Notification - Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM