[英]How to define WINDOWING function in Spark SQL query to avoid repetitive code
I have a query which has many lead and lag, due to which the partition by code is repeated. 我有一个查询,该查询有很多提前和滞后,由于该重复,按代码进行分区。
If I use Scala code I can define the window spec and reuse it , so is there a way I can reuse the partition code in Spark SQL. 如果我使用Scala代码,则可以定义窗口规范并重用它,那么有一种方法可以重用Spark SQL中的分区代码。
Objective is to avoid the repetition of "over ( partition by sessionId, deviceId order by entry_datetime ) " 目的是避免重复“ over(按sessionId划分分区,按entry_datetime划分deviceId顺序)”
SELECT * ,
lag( channel,1,null ) over ( partition by sessionId, deviceId order by entry_datetime ) as prev_chnl,
lead( channel,1,null ) over ( partition by sessionId, deviceId order by entry_datetime ) as next_chnl,
lag( channel-source,1,null ) over ( partition by sessionId, deviceId order by entry_datetime ) as prev_chnl_source,
lead( channel-source,1,null ) over ( partition by sessionId, deviceId order by entry_datetime ) as next_chnl_source,
FROM RAW_VIEW
RAW_VIEW RAW_VIEW
+------------+-----------+---------------------+---------+-----------------+
|sessionId |deviceId |entry_datetime |channel |channel-source |
+------------+-----------+---------------------+---------+-----------------+
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 15:00:00.0|001 |Internet |
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 16:00:00.0|002 |Cable |
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 17:00:00.0|003 |Satellite |
+------------+-----------+---------------------+---------+-----------------+
FINAL VIEW 最终观点
+------------+-----------+---------------------+---------+-----------------+---------+---------+-----------------+-----------------+
|sessionId |deviceId |entry_datetime |channel |channel-source |prev_chnl|next_chnl|prev_chnl_source |next_chnl_source |
+------------+-----------+---------------------+---------+-----------------+---------+---------+-----------------+-----------------+
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 15:00:00.0|001 |Internet |null |002 |null |Cable |
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 15:01:00.0|002 |Cable |001 |003 |Internet |Satellite |
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 15:02:00.0|003 |Satellite |002 |null |Cable |null |
+------------+-----------+---------------------+---------+-----------------+---------+---------+-----------------+-----------------+
You should be able to define named window and reference it in the query: 您应该能够定义命名窗口并在查询中引用它:
SELECT * ,
lag(channel, 1) OVER w AS prev_chnl,
lead(channel, 1) OVER w AS next_chnl,
lag(channel-source, 1) OVER w AS prev_chnl_source,
lead(channel-source, 1) OVER w AS next_chnl_source,
FROM raw_view
WINDOW w AS (PARTITION BY sessionId, deviceId ORDER BY entry_datetime)
but it looks like this functionality is currently broken. 但该功能当前已损坏。
If you want to do this in spark-sql, one way is to to add row_number()
to your table over your ordered partitions. 如果要在spark-sql中执行此操作,一种方法是在有序分区上向表中添加
row_number()
。 Then create a lag and lead version of this table by subtracting / adding 1 to the row_number. 然后通过在row_number中减去/加1来创建此表的滞后和提前版本。 Finally do a
LEFT JOIN
of the current table with the previous and next versions and select the appropriate columns. 最后,使用前一个版本和下一个版本对当前表进行
LEFT JOIN
,然后选择适当的列。
For example, try the following: 例如,尝试以下操作:
SELECT curr.*,
prev.channel AS prev_chnl,
next.channel AS next_chnl,
prev.channel_source AS prev_chnl_source,
next.channel_source AS next_chnl_source
FROM (SELECT *,
ROW_NUMBER() OVER (partition by sessionId,
deviceId
order by entry_datetime) AS row_num
FROM RAW_VIEW
) curr
LEFT JOIN (SELECT *,
ROW_NUMBER() OVER (partition by sessionId,
deviceId
order by entry_datetime) + 1 AS row_num
FROM RAW_VIEW
) prev ON (curr.row_num = prev.row_num)
LEFT JOIN (SELECT *,
ROW_NUMBER() OVER (partition by sessionId,
deviceId
order by entry_datetime) - 1 AS row_num
FROM RAW_VIEW
) next ON (next.row_num = curr.row_num)
ORDER BY entry_datetime
Which results in: 结果是:
+------------+-----------+---------------------+-------+--------------+-------+---------+---------+----------------+----------------+
|sessionId |deviceId |entry_datetime |channel|channel_source|row_num|prev_chnl|next_chnl|prev_chnl_source|next_chnl_source|
+------------+-----------+---------------------+-------+--------------+-------+---------+---------+----------------+----------------+
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 15:00:00.0|001 |Internet |1 |null |002 |null |Cable |
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 16:00:00.0|002 |Cable |2 |001 |003 |Internet |Satellite |
|SESSION-ID-1|DEVICE-ID-1|2018-04-09 17:00:00.0|003 |Satellite |3 |002 |null |Cable |null |
+------------+-----------+---------------------+-------+--------------+-------+---------+---------+----------------+----------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.