简体   繁体   English

SparkSQL 中的累积数组聚合

[英]Cumulative array aggregation in SparkSQL

I have the following dataset我有以下数据集

event_id event_id user_id用户身份 event事件 event_type事件类型 event_ts event_ts item_id item_id next_event_type下一个事件类型 next_item_id next_item_id
246984 246984 993922 993922 {"item_id":1000,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} {"item_id":1000,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} ITEM_PURCHASED ITEM_PURCHASED 5260 5260 1000 1000 ITEM_PURCHASED ITEM_PURCHASED 1001 1001
246984 246984 993922 993922 {"item_id":1001,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} {"item_id":1001,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} ITEM_PURCHASED ITEM_PURCHASED 5855 5855 1001 1001 ITEM_PURCHASED ITEM_PURCHASED 1002 1002

I want to cumulatively append the next item_id to the array.我想累积 append 到数组的下一个item_id I know I can do this in a udf, but the dataset is quite massive and want to avoid a performance hit.我知道我可以在 udf 中执行此操作,但是数据集非常庞大,并且希望避免性能下降。

event_id event_id user_id用户身份 event事件 event_type事件类型 event_ts event_ts item_id item_id next_event_type下一个事件类型 next_item_id next_item_id next_item_set下一个项目集
246984 246984 993922 993922 {"item_id":1000,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} {"item_id":1000,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} ITEM_PURCHASED ITEM_PURCHASED 5260 5260 1000 1000 ITEM_PURCHASED ITEM_PURCHASED 1001 1001 [1000, 1001] [1000, 1001]
246984 246984 993922 993922 {"item_id":1001,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} {"item_id":1001,"user_id":993922,"timestamp":5260,"type":"ITEM_PURCHASED"} ITEM_PURCHASED ITEM_PURCHASED 5855 5855 1001 1001 ITEM_PURCHASED ITEM_PURCHASED 1002 1002 [1000, 1001, 1002] [1000, 1001, 1002]

This is the query I have so far这是我到目前为止的查询

with a as (
select event_id
    , user_id
    , event
    , event_type
    , event_ts
    , item_id
    , lead(event_type) over (partition by user_id order by event_ts) as next_event_type
    , lead(item_id) over (partition by user_id order by event_ts) as next_item_id
from tableA
)
select *
, concat(lag(next_item_set) over (order by event_ts), array(next_item_id))  as cumulative_item_set
from a
;

You could use collect_list or collect_set and specify the window's frame from unbounded preceding to 1 following .您可以使用collect_listcollect_set并指定窗口的框架从unbounded preceding before 到1 following Try adding this to your select clause:尝试将此添加到您的 select 子句中:

collect_list(item_id) over (partition by user_id order by event_ts rows between unbounded preceding and 1 following) as next_item_set

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM