[英]Snowflake Analytical Query Design
我有一個棘手的查詢設計要求,我嘗試了分析 function 的不同類型/不同組合,以從以下數據集中獲得我的結果。 我的另一個計划是寫存儲過程,但是我想在改變方向之前聯系專家組。
輸入數據集:
Required Output Data Set with the Group Column: When there is a session id change in the session id and if i get back the same session id again, i have to have a different group to it. 我嘗試使用 LEAD/LAG 組合,但是無法獲得以下所需的 output,一種或其他情況正在中斷。
謝謝 !
SQL 語言的表達能力足以為復雜的需求找到聲明式解決方案。
Snowflake 最近實施了 SQL 2016 標准條款: MATCH_RECOGNIZE ,旨在以非常直接的方式解決此類情況。
在某些情況下,您可能需要識別與模式匹配的表行序列。 例如,您可能需要:
在打開支持票或進行購買之前,確定哪些用戶在您的網站上遵循了特定的頁面序列和操作。
找出價格在一段時間內出現 V 型或 W 型復蘇的股票。
在傳感器數據中尋找可能表明即將發生系統故障的模式。
資料准備:
CREATE OR REPLACE TABLE t
AS
SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:30:00'::DATE AS Trans_dt, 1 AS VERSION_ID
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:35:00'::DATE AS Trans_dt, 2
UNION ALL SELECT 102 SESS_ID, 1 POL_ID, '2021-04-17 09:37:00'::DATE AS Trans_dt, 3
UNION ALL SELECT 102 SESS_ID, 1 POL_ID, '2021-04-17 09:38:00'::DATE AS Trans_dt, 4
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:39:00'::DATE AS Trans_dt, 5
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:40:00'::DATE AS Trans_dt, 6;
詢問:
SELECT *
FROM t
MATCH_RECOGNIZE (
PARTITION BY POL_ID
ORDER BY VERSION_ID
MEASURES MATCH_NUMBER() AS group_id
--,CLASSIFIER() as cks
ALL ROWS PER MATCH
PATTERN (a+b*)
DEFINE a as sess_id = FIRST_VALUE(sess_id)
,b AS sess_id != FIRST_VALUE(sess_id)
) mr
ORDER BY POL_ID, VERSION_ID;
Output:
SESS_ID POL_ID TRANS_DT VERSION_ID GROUP_ID
101 1 2021-04-17 1 1
101 1 2021-04-17 2 1
102 1 2021-04-17 3 1
102 1 2021-04-17 4 1
101 1 2021-04-17 5 2
101 1 2021-04-17 6 2
這個怎么運作:
(a+b*)
這是 Perl 風格的正則表達式,a(一個或多個) b(零個或多個)MATCH_NUMBER()
-“返回匹配的序號”POL_ID
執行此操作,並使用VERSION_ID
作為排序列所以在下面,你希望group_id
如何與pol_id
相關並不明顯,所以我忽略了它。
所以使用 CTE 只是為了偽造data
。
WITH data AS (
SELECT * FROM VALUES
(101, 1, '2021-04-17 09:30:00', 1),
(101, 1, '2021-04-17 09:35:00', 2),
(102, 1, '2021-04-17 09:37:00', 3),
(102, 1, '2021-04-17 09:38:00', 4),
(101, 1, '2021-04-17 09:39:00', 5),
(101, 1, '2021-04-17 09:40:00', 6)
v(sess_id, pol_id, trans_dt, version_id)
)
然后我想編寫這些操作:
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
,r1- r2 as r3
,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
,IFF(lag_r3 != r3, 1, 0) as sess_edge
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM data
所以r1
和r2
正在尋找sess_id
相對於trans_dt
的差距,那么你想要r3
和lag_r3
相對於trans_dt
的那些變化,這些是你想要計算的邊,因此是SUM
,即從零開始,所以+1
可以得到你想要的值。
現在上述在雪花中無效,因此需要分層才能工作:
SELECT
*
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (
SELECT
*
,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
,IFF(lag_r3 != r3, 1, 0) as sess_edge
FROM (
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
,r1- r2 as r3
FROM data
)
)
ORDER BY trans_dt;
這使:
SESS_ID POL_ID TRANS_DT VERSION_ID R1 R2 R3 LAG_R3 SESS_EDGE GROUP_ID
101 1 2021-04-17 09:30:00 1 1 1 0 null 0 1
101 1 2021-04-17 09:35:00 2 2 2 0 0 0 1
102 1 2021-04-17 09:37:00 3 3 1 2 null 0 1
102 1 2021-04-17 09:38:00 4 4 2 2 2 0 1
101 1 2021-04-17 09:39:00 5 5 3 2 0 1 2
101 1 2021-04-17 09:40:00 6 6 4 2 2 0 2
所以可以看出它是如何工作的。 然后可以將其壓縮為:
SELECT
sess_id
,pol_id
,trans_dt
,version_id
,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (
SELECT
*
,IFF(LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) != r3, 1, 0) as sess_edge
FROM (
SELECT *
,ROW_NUMBER() OVER (ORDER BY trans_dt)- ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) as r3
FROM data
)
)
ORDER BY trans_dt;
這比戈登的答案復雜得多,戈登的答案改寫成與我相同的形式:
select *
,sum(edge) over ( partition by pol_id, sess_id order by trans_dt ) as grouping
from (
select *
,lag(sess_id) over (partition by pol_id order by trans_dt) as prev_session_id
,iff(prev_session_id = sess_id, 0, 1) AS edge
from data
)
ORDER BY 2,3;
這相當聰明,因為 SUMing 每個sess_id
的邊緣
但是如果你添加額外的數據:
WITH data AS (
SELECT * FROM VALUES
(101, 1, '2021-04-17 09:30:00', 1),
(101, 1, '2021-04-17 09:35:00', 2),
(102, 1, '2021-04-17 09:37:00', 3),
(102, 1, '2021-04-17 09:38:00', 4),
(101, 1, '2021-04-17 09:39:00', 5),
(101, 1, '2021-04-17 09:40:00', 6),
(102, 1, '2021-04-17 09:41:00', 7),
(102, 1, '2021-04-17 09:42:00', 8),
(103, 1, '2021-04-17 09:43:00', 9),
(103, 1, '2021-04-17 09:44:00', 10)
v(sess_id, pol_id, trans_dt, VERSION_ID)
)
Gordon 的答案會將最后兩個 session 分配到第 1 組,而我的將分配 2,Lukasz 也將分配,這取決於您的預期結果。
另外,當pol_id
更改時,您希望 group 成為全局計數,或者第二個 pol 再次具有值 1 時,您希望發生什么?
基本上,您想使用lag()
來查看 session id 何時更改。 然后你想要一個累積和,但只在每個 session id 內:
select t.*,
sum(case when prev_session_id = session_id then 0 else 1 end) over (
partition by pol_id, session_id
order by trans_dt
) as grouping
from (select t.*,
lag(session_id) over (partition by pol_id order by trans_dt) as prev_session_id
from t
) t;
這是群島問題的一個棘手變體。 更正常的情況是將三對行枚舉為 1、2 和 3。為此,您只需在sum()
中從partition by
中刪除session_id
。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.