简体   繁体   中英

Kusto query to cluster time-series data into 'sessions' and assign sessionId

I have a time-series data of below format:

datatable(Element: string, Timestamp:datetime, Value:long)

For each Element there is a Timestamp and associated Value for the timestamp. If 2 consecuitive timestamps for a element is more then X-min apart, they are considered to be part of different sessions (The smaller timestamp being end of previous session and bigger one is starting of a new session). For each such session I want to calculate the SessionId (based on session start or a random guid), session start and session end.

Example: (Considering a timestamp with gap of 30-min from previous is considered as starting of a new session)

Input:

 Element           Timestamp       Value
Element-A   2022-03-25 06:15:00     10
Element-A   2022-03-25 06:30:00     10
Element-A   2022-03-25 06:45:00     10
Element-A   2022-03-25 08:15:00     10
Element-A   2022-03-25 08:30:00     10
Element-A   2022-03-25 08:45:00     10
Element-B   2022-03-25 07:15:00     10
Element-B   2022-03-25 07:30:00     10
Element-B   2022-03-25 07:45:00     10
Element-B   2022-03-25 09:15:00     10
Element-B   2022-03-25 09:30:00     10
Element-B   2022-03-25 09:45:00     10

Expected output:

 Element           Timestamp       value    SessionId           SessionStart             SessionEnd
Element-A   2022-03-25 06:15:00     10        guid-1        2022-03-25 06:15:00      2022-03-25 06:45:00
Element-A   2022-03-25 06:30:00     10        guid-1        2022-03-25 06:15:00      2022-03-25 06:45:00
Element-A   2022-03-25 06:45:00     10        guid-1        2022-03-25 06:15:00      2022-03-25 06:45:00
Element-A   2022-03-25 08:15:00     10        guid-2        2022-03-25 08:15:00      2022-03-25 08:45:00
Element-A   2022-03-25 08:30:00     10        guid-2        2022-03-25 08:15:00      2022-03-25 08:45:00
Element-A   2022-03-25 08:45:00     10        guid-2        2022-03-25 08:15:00      2022-03-25 08:45:00
Element-B   2022-03-25 07:15:00     10        guid-3        2022-03-25 07:15:00      2022-03-25 07:45:00
Element-B   2022-03-25 07:30:00     10        guid-3        2022-03-25 07:15:00      2022-03-25 07:45:00
Element-B   2022-03-25 07:45:00     10        guid-3        2022-03-25 07:15:00      2022-03-25 07:45:00
Element-B   2022-03-25 09:15:00     10        guid-4        2022-03-25 09:15:00      2022-03-25 09:45:00
Element-B   2022-03-25 09:30:00     10        guid-4        2022-03-25 09:15:00      2022-03-25 09:45:00
Element-B   2022-03-25 09:45:00     10        guid-4        2022-03-25 09:15:00      2022-03-25 09:45:00

The data volume is high. Please suggest with perf-efficient queries to achieve this.

Per the OP comments, adding a solution with only the summarization part. Please note that -

  1. the combination of Element & SessionIndex is unique and can be used interchangeably with SessionId (based on new_guid() )
  2. Since this solution is based on summarization, additional info can easily be collected per session, such as number of events per session, min/max/avg value per session, number of events with value higher than x (based on count_if ) etc.

datatable (Element:string, Timestamp:datetime, Value:int)
[
     "Element-A" ,"2022-03-25 06:15:00" ,10
    ,"Element-A" ,"2022-03-25 06:30:00" ,10
    ,"Element-A" ,"2022-03-25 06:45:00" ,10
    ,"Element-A" ,"2022-03-25 08:15:00" ,10
    ,"Element-A" ,"2022-03-25 08:30:00" ,10
    ,"Element-A" ,"2022-03-25 08:45:00" ,10
    ,"Element-B" ,"2022-03-25 07:15:00" ,10
    ,"Element-B" ,"2022-03-25 07:30:00" ,10
    ,"Element-B" ,"2022-03-25 07:45:00" ,10
    ,"Element-B" ,"2022-03-25 09:15:00" ,10
    ,"Element-B" ,"2022-03-25 09:30:00" ,10
    ,"Element-B" ,"2022-03-25 09:45:00" ,10
]
| partition hint.strategy=shuffle by Element
(
    order by Timestamp asc
    | extend SessionIndex = row_cumsum(iff(Timestamp - prev(Timestamp) > 30m, 1, 0))
    | summarize SessionStart = min(Timestamp), SessionEnd = max(Timestamp) by SessionIndex
    | extend Element, SessionId = new_guid()
    | project-reorder Element
)

Fiddle

Element SessionIndex SessionStart SessionEnd SessionId
Element-A 0 2022-03-25T06:15:00Z 2022-03-25T06:45:00Z 5d43e356-9aae-40cb-9e2e-bd2741cc9934
Element-B 0 2022-03-25T07:15:00Z 2022-03-25T07:45:00Z df83db35-c292-4bee-a14e-0ebc2b7ef6b5
Element-A 1 2022-03-25T08:15:00Z 2022-03-25T08:45:00Z 40dbaa02-b110-4e99-8696-2505a2995553
Element-B 1 2022-03-25T09:15:00Z 2022-03-25T09:45:00Z 59d6fdeb-a596-4fab-97e5-d9057519c6c0

You can start with this.
The demographics of your data (number of records, number of elements, number of sessions per element) will determine how optimized this solution is for your specific needs.

datatable (Element:string, Timestamp:datetime, Value:int)
[
     "Element-A" ,"2022-03-25 06:15:00" ,10
    ,"Element-A" ,"2022-03-25 06:30:00" ,10
    ,"Element-A" ,"2022-03-25 06:45:00" ,10
    ,"Element-A" ,"2022-03-25 08:15:00" ,10
    ,"Element-A" ,"2022-03-25 08:30:00" ,10
    ,"Element-A" ,"2022-03-25 08:45:00" ,10
    ,"Element-B" ,"2022-03-25 07:15:00" ,10
    ,"Element-B" ,"2022-03-25 07:30:00" ,10
    ,"Element-B" ,"2022-03-25 07:45:00" ,10
    ,"Element-B" ,"2022-03-25 09:15:00" ,10
    ,"Element-B" ,"2022-03-25 09:30:00" ,10
    ,"Element-B" ,"2022-03-25 09:45:00" ,10
]
| partition hint.strategy=shuffle by Element
(
    order by Timestamp asc
    | extend SessionIndex = row_cumsum(iff(Timestamp - prev(Timestamp) > 30m, 1, 0))
    | summarize min(Timestamp), max(Timestamp), make_list(Timestamp), make_list(Value) by SessionIndex
    | extend SessionId = new_guid()
    | mv-apply Timestamp = list_Timestamp to typeof(datetime), Value = list_Value to typeof(int) on (project Timestamp, Value)
    | project Element, Timestamp, Value, SessionStart = min_Timestamp, SessionEnd = max_Timestamp, SessionId, SessionIndex
)
Element Timestamp Value SessionStart SessionEnd SessionId SessionIndex
Element-A 2022-03-25T06:15:00Z 10 2022-03-25T06:15:00Z 2022-03-25T06:45:00Z 1ac146b1-24fa-427e-b2b3-663d83297d4c 0
Element-A 2022-03-25T06:30:00Z 10 2022-03-25T06:15:00Z 2022-03-25T06:45:00Z 1ac146b1-24fa-427e-b2b3-663d83297d4c 0
Element-A 2022-03-25T06:45:00Z 10 2022-03-25T06:15:00Z 2022-03-25T06:45:00Z 1ac146b1-24fa-427e-b2b3-663d83297d4c 0
Element-B 2022-03-25T07:15:00Z 10 2022-03-25T07:15:00Z 2022-03-25T07:45:00Z cbef109a-73bc-4067-9e7f-ebada6aa444e 0
Element-B 2022-03-25T07:30:00Z 10 2022-03-25T07:15:00Z 2022-03-25T07:45:00Z cbef109a-73bc-4067-9e7f-ebada6aa444e 0
Element-B 2022-03-25T07:45:00Z 10 2022-03-25T07:15:00Z 2022-03-25T07:45:00Z cbef109a-73bc-4067-9e7f-ebada6aa444e 0
Element-A 2022-03-25T08:15:00Z 10 2022-03-25T08:15:00Z 2022-03-25T08:45:00Z c53fba2e-b82e-418c-9380-1e732be8fcb5 1
Element-A 2022-03-25T08:30:00Z 10 2022-03-25T08:15:00Z 2022-03-25T08:45:00Z c53fba2e-b82e-418c-9380-1e732be8fcb5 1
Element-A 2022-03-25T08:45:00Z 10 2022-03-25T08:15:00Z 2022-03-25T08:45:00Z c53fba2e-b82e-418c-9380-1e732be8fcb5 1
Element-B 2022-03-25T09:15:00Z 10 2022-03-25T09:15:00Z 2022-03-25T09:45:00Z 4ab89211-4378-45d3-8ac7-a570942e2807 1
Element-B 2022-03-25T09:30:00Z 10 2022-03-25T09:15:00Z 2022-03-25T09:45:00Z 4ab89211-4378-45d3-8ac7-a570942e2807 1
Element-B 2022-03-25T09:45:00Z 10 2022-03-25T09:15:00Z 2022-03-25T09:45:00Z 4ab89211-4378-45d3-8ac7-a570942e2807 1

Fiddle

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM