SparkSQL 中的累积数组聚合

Question

I have the following dataset我有以下数据集

event_id event_id	user_id用户身份	event事件	event_type事件类型	event_ts event_ts	item_id item_id	next_event_type下一个事件类型	next_item_id next_item_id
246984 246984	993922 993922	{"item_id":1000,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} {"item_id":1000,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"}	ITEM_PURCHASED ITEM_PURCHASED	5260 5260	1000 1000	ITEM_PURCHASED ITEM_PURCHASED	1001 1001
246984 246984	993922 993922	{"item_id":1001,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} {"item_id":1001,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"}	ITEM_PURCHASED ITEM_PURCHASED	5855 5855	1001 1001	ITEM_PURCHASED ITEM_PURCHASED	1002 1002

I want to cumulatively append the next item_id to the array.我想累积 append 到数组的下一个item_id 。 I know I can do this in a udf, but the dataset is quite massive and want to avoid a performance hit.我知道我可以在 udf 中执行此操作，但是数据集非常庞大，并且希望避免性能下降。

event_id event_id	user_id用户身份	event事件	event_type事件类型	event_ts event_ts	item_id item_id	next_event_type下一个事件类型	next_item_id next_item_id	next_item_set下一个项目集
246984 246984	993922 993922	{"item_id":1000,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} {"item_id":1000,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"}	ITEM_PURCHASED ITEM_PURCHASED	5260 5260	1000 1000	ITEM_PURCHASED ITEM_PURCHASED	1001 1001	[1000, 1001] [1000, 1001]
246984 246984	993922 993922	{"item_id":1001,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} {"item_id":1001,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"}	ITEM_PURCHASED ITEM_PURCHASED	5855 5855	1001 1001	ITEM_PURCHASED ITEM_PURCHASED	1002 1002	[1000, 1001, 1002] [1000, 1001, 1002]

This is the query I have so far这是我到目前为止的查询

with a as (
select event_id
    , user_id
    , event
    , event_type
    , event_ts
    , item_id
    , lead(event_type) over (partition by user_id order by event_ts) as next_event_type
    , lead(item_id) over (partition by user_id order by event_ts) as next_item_id
from tableA
)
select *
, concat(lag(next_item_set) over (order by event_ts), array(next_item_id))  as cumulative_item_set
from a
;

Answer 1

You could use collect_list or collect_set and specify the window's frame from unbounded preceding to 1 following .您可以使用collect_list或collect_set并指定窗口的框架从unbounded preceding before 到1 following 。 Try adding this to your select clause:尝试将此添加到您的 select 子句中：

collect_list(item_id) over (partition by user_id order by event_ts rows between unbounded preceding and 1 following) as next_item_set

SparkSQL 中的累积数组聚合

问题描述

1 个解决方案

解决方案1
0 2022-08-08 02:35:16

SparkSQL 中的累积数组聚合

问题描述

1 个解决方案

解决方案1 0 2022-08-08 02:35:16

解决方案1
0 2022-08-08 02:35:16