[英]In spark how to use window spec with aggregate functions
I have spark data frame that looks like:我有看起来像的火花数据框:
+------------+---------+---------------------------------------------------------------------------------------------------------+
|parent_key |id |value |raw_is_active |updated_at |
+------------+---------+------------------------------------------------------------+------------------------+-------------------+
|1 |2 |[, 0, USER, 2020-12-11 04:50:40, 2020-12-11 04:50:40,] |[2020-12-11 04:50:40, 0]|2020-12-11 04:50:40|
|1 |2 |[testA, 0, USER, 2020-12-11 04:50:40, 2020-12-11 17:18:00,] |null |2020-12-11 17:18:00|
|1 |2 |[testA, 0, USER, 2020-12-11 04:50:40, 2020-12-11 17:19:56,] |null |2020-12-11 17:19:56|
|1 |2 |[testA, 1, USER, 2020-12-11 04:50:40, 2020-12-11 17:20:24,] |[2020-12-11 17:20:24, 1]|2020-12-11 17:20:24|
|2 |3 |[testB, 0, USER, 2020-12-11 17:24:03, 2020-12-11 17:24:03,] |[2020-12-11 17:24:03, 0]|2020-12-11 17:24:03|
|3 |4 |[testC, 0, USER, 2020-12-11 17:27:36, 2020-12-11 17:27:36,] |[2020-12-11 17:27:36, 0]|2020-12-11 17:27:36|
+------------+---------+------------------------------------------------------------+------------------------+-------------------+
Schema is:架构是:
root
|-- parent_key: long (nullable = true)
|-- id: string (nullable = true)
|-- value: struct (nullable = true)
| |-- first_name: string (nullable = true)
| |-- is_active: integer (nullable = true)
| |-- source: string (nullable = true)
| |-- created_at: timestamp (nullable = true)
| |-- updated_at: timestamp (nullable = true)
|-- raw_is_active: struct (nullable = true)
| |-- updated_at: timestamp (nullable = true)
| |-- value: integer (nullable = true)
|-- updated_at: timestamp (nullable = true)
I am looking for an output:我正在寻找 output:
+------------+---------+------------------------------------------------------------+---------------------------------------------------+-------------------+
|parent_key |id |value |raw_is_active |updated_at |
+------------+---------+---------------------------------------------------------------------------------------------------------+--------------------------+
|1 |2 |[testA, 1, USER, 2020-12-11 04:50:40, 2020-12-11 17:20:24] |[[2020-12-11 04:50:40, 0],[2020-12-11 17:20:24, 1]]|2020-12-11 04:50:40|
|2 |3 |[testB, 0, USER, 2020-12-11 17:24:03, 2020-12-11 17:24:03] |[2020-12-11 17:24:03, 0] |2020-12-11 17:24:03|
|3 |4 |[testC, 0, USER, 2020-12-11 17:27:36, 2020-12-11 17:27:36] |[2020-12-11 17:27:36, 0] |2020-12-11 17:27:36|
+------------+---------+---------------------------------------------------------------------------------------------------------+--------------------------+
So on the basis of the updated_at
column I want to keep the latest row and also wants to create an array for raw_is_active
for all the rows for a given id
.因此,在
updated_at
列的基础上,我想保留最新的行,并且还想为给定id
的所有行创建raw_is_active
数组。
I know I can pick the latest value
using code:我知道我可以使用代码选择最新
value
:
val windowSpec = Window.partitionBy("id").orderBy(col("updated_at").desc)
dataFrame
.withColumn("maxTS", first("updated_at").over(windowSpec))
.select("*").where(col("maxTS") === col("updated_at"))
.drop("maxTS")
But not sure how I can also create a set for raw_is_active
column.但不确定我如何也可以为
raw_is_active
列创建一个集合。
Or I can completely use group by function like:或者我可以完全按 function 使用 group,例如:
dataFrame
.groupBy("parent_key", "id")
.agg(collect_list("value") as "value_list", collect_set("raw_is_active") as "active_list")
.withColumn("value", col("value_list")(size(col("value_list")).minus(1)))
.drop("value_list")
For the above I am not sure对于上述我不确定
.withColumn("value", col("value_list")(size(col("value_list")).minus(1)))
will always give me the latest value .withColumn("value", col("value_list")(size(col("value_list")).minus(1)))
总是会给我最新的值collect_list
and collect_set
, is this code efficient?collect_list
和collect_set
,这段代码有效吗? UPDATE Thanks to @mck, I was able to get it working with code:更新感谢@mck,我能够让它与代码一起工作:
val windowSpec = Window.partitionBy("id").orderBy(col("updated_at").desc)
val windowSpecSet = Window.partitionBy("id").orderBy(col("updated_at"))
val df2 = dataFrame.withColumn(
"rn",
row_number().over(windowSpec)
).withColumn(
"active_list",
collect_set("raw_is_active").over(windowSpecSet)
).drop("raw_is_active").filter("rn = 1")
However the code is taking more time than my existing code:但是,该代码比我现有的代码花费更多时间:
dataFrame
.groupBy("parent_key", "id")
.agg(collect_list("value") as "value_list", collect_set("raw_is_active") as "active_list")
.withColumn("value", col("value_list")(size(col("value_list")).minus(1)))
.drop("value_list")
I was under impression that window function would perform better than groupBy
and agg
.我的印象是 window function 会比
groupBy
和agg
表现更好。
Assign a row_number
for each row in each id partition and filter the rows with row_number = 1
:为每个 id 分区中的每一行分配一个
row_number
并过滤row_number = 1
的行:
val windowSpec = Window.partitionBy("id").orderBy(col("updated_at").desc)
val df2 = dataFrame.withColumn(
"rn",
row_number().over(windowSpec)
).withColumn(
"active_list",
array_sort(collect_set("raw_is_active").over(windowSpec))
).drop("raw_is_active").filter("rn = 1")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.