[英]Cumulative array aggregation in SparkSQL
I have the following dataset我有以下数据集
event_id ![]() |
user_id![]() |
event![]() |
event_type![]() |
event_ts ![]() |
item_id ![]() |
next_event_type![]() |
next_item_id ![]() |
---|---|---|---|---|---|---|---|
246984 ![]() |
993922 ![]() |
{"item_id":1000,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} ![]() |
ITEM_PURCHASED ![]() |
5260 ![]() |
1000 ![]() |
ITEM_PURCHASED ![]() |
1001 ![]() |
246984 ![]() |
993922 ![]() |
{"item_id":1001,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} ![]() |
ITEM_PURCHASED ![]() |
5855 ![]() |
1001 ![]() |
ITEM_PURCHASED ![]() |
1002 ![]() |
I want to cumulatively append the next item_id
to the array.我想累积 append 到数组的下一个
item_id
。 I know I can do this in a udf, but the dataset is quite massive and want to avoid a performance hit.我知道我可以在 udf 中执行此操作,但是数据集非常庞大,并且希望避免性能下降。
event_id ![]() |
user_id![]() |
event![]() |
event_type![]() |
event_ts ![]() |
item_id ![]() |
next_event_type![]() |
next_item_id ![]() |
next_item_set![]() |
---|---|---|---|---|---|---|---|---|
246984 ![]() |
993922 ![]() |
{"item_id":1000,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} ![]() |
ITEM_PURCHASED ![]() |
5260 ![]() |
1000 ![]() |
ITEM_PURCHASED ![]() |
1001 ![]() |
[1000, 1001] ![]() |
246984 ![]() |
993922 ![]() |
{"item_id":1001,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} ![]() |
ITEM_PURCHASED ![]() |
5855 ![]() |
1001 ![]() |
ITEM_PURCHASED ![]() |
1002 ![]() |
[1000, 1001, 1002] ![]() |
This is the query I have so far这是我到目前为止的查询
with a as (
select event_id
, user_id
, event
, event_type
, event_ts
, item_id
, lead(event_type) over (partition by user_id order by event_ts) as next_event_type
, lead(item_id) over (partition by user_id order by event_ts) as next_item_id
from tableA
)
select *
, concat(lag(next_item_set) over (order by event_ts), array(next_item_id)) as cumulative_item_set
from a
;
You could use collect_list
or collect_set
and specify the window's frame from unbounded preceding
to 1 following
.您可以使用
collect_list
或collect_set
并指定窗口的框架从unbounded preceding
before 到1 following
。 Try adding this to your select clause:尝试将此添加到您的 select 子句中:
collect_list(item_id) over (partition by user_id order by event_ts rows between unbounded preceding and 1 following) as next_item_set
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.