[英]How to aggregate values across different columns in PySpark (or eventually SQL)?
讓我們考慮以下輸入數據
| incremental_id | session_start_id | session_end_id | items_bought |
|----------------|------------------|----------------|--------------|
| 1 | a | b | 1 |
| 2 | z | t | 7 |
| 3 | b | c | 0 |
| 4 | c | d | 3 |
在哪里:
session_end_id = session_start_id
。 第 4 行與第二個用戶相關我希望能夠匯總上述數據,以便獲得:
這怎么能在 PySpark (或最終在純 SQL)中完成? 我想避免在 PySpark 中使用 UDF,但如果這是唯一的方法也沒關系。
謝謝你的幫助!
編輯:我已經更新了示例 dataframe,單獨的incremental_id
不能用於將行排序為連續會話
公用表表達式是SQL:1999的一部分。
使用 CTE,我們可以使用以下查詢
WITH cte(session_start_id, session_end_id, items_bought) AS (
select session_start_id, session_end_id, items_bought from user_session where session_start_id not in (
select session_end_id from user_session)
UNION ALL
select a.session_start_id, b.session_end_id, b.items_bought from cte a
inner join user_session b on a.session_end_id = b.session_start_id)
select session_start_id, sum(items_bought) from cte group by (session_start_id)
解釋:
SQL 小提琴鏈接: http://sqlfiddle.com/#!4/ac98a/4/0
(注意:在小提琴中使用了 Oracle 。但是任何支持 CTE 的數據庫引擎都應該可以工作)。
這是 PySpark 版本
from pyspark.sql import Window
from pyspark.sql import functions as F
from pyspark.sql.types import *
# create a window over the full data so we can lag the session end id
win = Window().partitionBy().orderBy("incremental_id")
# This is logic to indicate a user change
df = df.withColumn('user_boundary', F.lag(F.col("session_end_id"), 1).over(win) != F.col("session_start_id"))
df = df.withColumn('user_boundary', F.when(F.col("user_boundary").isNull(), F.lit(False)).otherwise(F.col("user_boundary")))
# Now create an artificial user id
df = df.withColumn('user_id', F.sum(F.col("user_boundary").cast(IntegerType())).over(win))
# Aggregate
df.groupby('user_id').agg(F.sum(F.col("items_bought")).alias("total_bought")).show()
+-------+------------+
|user_id|total_bought|
+-------+------------+
| 0| 4|
| 1| 7|
+-------+------------+
如果您能夠訪問臨時表創建和受影響的行計數元數據,那么您可以移植:
insert into #CTESubs
select
session_start_id,
session_end_id,
items_bought
from #user_session
WHERE
session_start_id not in (select session_end_id from #user_session)
while(@@ROWCOUNT <> 0)
begin
insert into #CTESubs
select distinct
p.session_start_id,
c.session_end_id,
c.items_bought
from #user_session c
inner join #CTESubs p on c.session_start_id = p.session_end_id
WHERE
p.session_start_id not in (select session_end_id from #user_session)
and c.session_end_id not in (select session_end_id from #CTESubs)
end
select
session_start_id,
sum(items_bought) items_bought
from #CTESubs
group by
session_start_id;
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.