简体   繁体   中英

Stream Analytics:When using a TUMBLING WINDOW is the start time of the window start based on the earliest time in the stream or start time of the job?

Context

I have been reading documentation on how the TUMBLINGWINDOW function is used along with the TIMSTAMP BY clause and can't seem to find a clear explanation on how the start date of a query which contains a TUMBLING WINDOW and TIMESTAMP BY field is calculated (must have missed it if it is present somewhere).

Here are the links to the documentation which I have been looking at:

I am quoting below the Time Consideration section in the TUMBLING WINDOW LINK (which is the primary source from which my question has arose)

Time Consideration

"Every window operation outputs event at the end of the window. The windows of Azure Stream Analytics are opened at the window start time and closed at the window end time. For example, if you have a 5 minute window from 12:00 AM to 12:05 AM all events with timestamp greater than 12:00 AM and up to timestamp 12:05 AM inclusive will be included within this window. The output of the window will be a single event based on the aggregate function used with a timestamp equal to the window end time. The timestamp of the output event of the window can be projected in the SELECT statement using the System.Timestamp() property using an alias."

It mentions a 5 minute window however doesn't seem to go into detail about why the 5 minute windows are started at this time and most importantly how this would generalise.

Note: I understand that this point might have been out of scope for this documentation but I haven't managed to find a clear explanation of this elsewhere either.

Question(s)

Say I have the following code (copied from docs with small modifications)

SELECT System.Timestamp() AS WindowEnd, TollId, COUNT(*)  
FROM Input TIMESTAMP BY EntryTime  
GROUP BY TumblingWindow(Duration(day, 1)), TollId
  • How is the start datetime of the window I create selected/chosen if I have a stream of data?
    • Is it based on the earliest time within the EntryTime (which is what I am selecting as the timestamp by field) from which it then logically forms an initial window to encompass or is it dependent on when I start my stream job running (creating the windows from the time at which the job has started and after)?
  • In the scenario that it is dependent on when I start the stream job what's the best way to make sure I start the window so that it includes all data wanted in the initial (and subsequent) windows (in my scenario aggregations by actual whole days eg start of 2022-02-22 T00:00:00 to 2022-02-23 T00:00:00)?
    • Would it be to set the start time to custom and select it to start on the beginning of the day? eg I set the job to start on 2022-02-22 T00:00:00 that way it will start the window at this time so that the first tumbling window (and subsequent ones) would include whole days of data starting from 2022-02-22 T00:00:00 on a days wise basis.

Thoughts

Up until now I have been on the assumption that whatever field I choose in the TIMESTAMP BY clause (eg EntryTime in the above code snippet) would define the field on which the window is created and then depending on the TUMBLINGWINDOW function arguments chosen (eg day wise in the above code snippet) would handle how the chosen timestamp field is "windowed" or sliced. Stream Analytics would then handle the window creation based on the earliest dates present in the source time field at the time of job starting (eg even if I start a job at 2022-02-22 T09:00:00 UTC if the data is present for the day 2022-02-21 then then the query would output for that day of data 2022-02-21T00:00:00 UTC to 2022-02-22T00:00:00 UTC since that would have passed by this point and the current window (2022-02-22T00:00:00 to 2022-02-23T00:00:00) would populate once that window is finished.

From the documentation here: https://learn.microsoft.com/en-us/stream-analytics-query/windowing-azure-stream-analytics#understanding-windows

Every window operation outputs event at the end of the window. The windows of Azure Stream Analytics are opened at the window start time and closed at the window end time. For example, if you have a 5 minute window from 12:00 AM to 12:05 AM all events with timestamp greater than 12:00 AM and up to timestamp 12:05 AM inclusive will be included within this window. The output of the window will be a single event based on the aggregate function used with a timestamp equal to the window end time. The timestamp of the output event of the window can be projected in the SELECT statement using the System.Timestamp() property using an alias. Every window automatically aligns itself to the zeroth hour. For example, a 5 minute tumbling window will align itself to (12:00-12:05], (12:05-12:10], ..., and so on.

If you have historical data that you want to output, you can set a custom query start time either as any point up to the max cache of your streaming source (ususally 7 days) or as at the point the query was last stopped, so you don't lose any data during maintenance windows.

The query, however, will only output data with a timestamp that is after the query start time.

Therefore, imagine that your first data has a timestamp of 2022-02-20 01:23:00 and your second a timestamp of 2022-02-21 15:08:00 . You start your streaming job as at 2022-02-21 14:00:00 , so your 10 minute windows base themselves on the midnight of the 21st and then progress in 10 minute windows from there. The query does not output anything until the 15:00 - 15:10 window of the 21st, as this is the first window that is both post your query start time and contains data. In this scenario you can see how the windows work and why your data with the 2022-02-20 01:23:00 timestamp would not be output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM