[英]select with window function (dense_rank()) in SparkSQL
我有一個表,其中包含客戶購買的記錄,我需要指定購買是在特定的日期時間窗口中進行的,一個窗口是8天,因此,如果我今天進行了購買,那么如果窗口號為1,則是五天中的一天是我的購買量,如果我在今天的第一天和接下來的8天內進行了購買,則第一次購買將在窗口1中進行,而最后一次購買將在窗口2中進行
create temporary table transactions
(client_id int,
transaction_ts datetime,
store_id int)
insert into transactions values
(1,'2018-06-01 12:17:37', 1),
(1,'2018-06-02 13:17:37', 2),
(1,'2018-06-03 14:17:37', 3),
(1,'2018-06-09 10:17:37', 2),
(2,'2018-06-02 10:17:37', 1),
(2,'2018-06-02 13:17:37', 2),
(2,'2018-06-08 14:19:37', 3),
(2,'2018-06-16 13:17:37', 2),
(2,'2018-06-17 14:17:37', 3)
窗口是8天,問題是我不明白如何指定density_rank()OVER(PARTITION BY)來查看日期時間並在8天內創建一個窗口,結果我需要這樣的東西
1,'2018-06-01 12:17:37', 1,1
1,'2018-06-02 13:17:37', 2,1
1,'2018-06-03 14:17:37', 3,1
1,'2018-06-09 10:17:37', 2,2
2,'2018-06-02 10:17:37', 1,1
2,'2018-06-02 13:17:37', 2,1
2,'2018-06-08 14:19:37', 3,2
2,'2018-06-16 13:17:37', 2,3
2,'2018-06-17 14:17:37', 3,3
任何想法如何得到它? 我可以在Mysql或Spark SQL中運行它,但是Mysql不支持分區。 仍然找不到解決方案! 任何幫助
您很可能可以使用時間和分區窗口函數在Spark SQL中解決此問題:
val purchases = Seq((1,"2018-06-01 12:17:37", 1), (1,"2018-06-02 13:17:37", 2), (1,"2018-06-03 14:17:37", 3), (1,"2018-06-09 10:17:37", 2), (2,"2018-06-02 10:17:37", 1), (2,"2018-06-02 13:17:37", 2), (2,"2018-06-08 14:19:37", 3), (2,"2018-06-16 13:17:37", 2), (2,"2018-06-17 14:17:37", 3)).toDF("client_id", "transaction_ts", "store_id")
purchases.show(false)
+---------+-------------------+--------+
|client_id|transaction_ts |store_id|
+---------+-------------------+--------+
|1 |2018-06-01 12:17:37|1 |
|1 |2018-06-02 13:17:37|2 |
|1 |2018-06-03 14:17:37|3 |
|1 |2018-06-09 10:17:37|2 |
|2 |2018-06-02 10:17:37|1 |
|2 |2018-06-02 13:17:37|2 |
|2 |2018-06-08 14:19:37|3 |
|2 |2018-06-16 13:17:37|2 |
|2 |2018-06-17 14:17:37|3 |
+---------+-------------------+--------+
val groupedByTimeWindow = purchases.groupBy($"client_id", window($"transaction_ts", "8 days")).agg(collect_list("transaction_ts").as("transaction_tss"), collect_list("store_id").as("store_ids"))
val withWindowNumber = groupedByTimeWindow.withColumn("window_number", row_number().over(windowByClient))
withWindowNumber.orderBy("client_id", "window.start").show(false)
+---------+---------------------------------------------+---------------------------------------------------------------+---------+-------------+
|client_id|window |transaction_tss |store_ids|window_number|
+---------+---------------------------------------------+---------------------------------------------------------------+---------+-------------+
|1 |[2018-05-28 17:00:00.0,2018-06-05 17:00:00.0]|[2018-06-01 12:17:37, 2018-06-02 13:17:37, 2018-06-03 14:17:37]|[1, 2, 3]|1 |
|1 |[2018-06-05 17:00:00.0,2018-06-13 17:00:00.0]|[2018-06-09 10:17:37] |[2] |2 |
|2 |[2018-05-28 17:00:00.0,2018-06-05 17:00:00.0]|[2018-06-02 10:17:37, 2018-06-02 13:17:37] |[1, 2] |1 |
|2 |[2018-06-05 17:00:00.0,2018-06-13 17:00:00.0]|[2018-06-08 14:19:37] |[3] |2 |
|2 |[2018-06-13 17:00:00.0,2018-06-21 17:00:00.0]|[2018-06-16 13:17:37, 2018-06-17 14:17:37] |[2, 3] |3 |
+---------+---------------------------------------------+---------------------------------------------------------------+---------+-------------+
如果需要,可以從store_ids或transaction_tss explode
列表元素。
希望能幫助到你!
我沒有使用提出的Spark解決方案,而是通過純sql邏輯和游標完成的。 它不是很有效,但我需要完成工作
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.