在火花中如何使用 window 规范与聚合函数

Question

I have spark data frame that looks like:我有看起来像的火花数据框：

+------------+---------+---------------------------------------------------------------------------------------------------------+
|parent_key  |id       |value                                                       |raw_is_active           |updated_at         |
+------------+---------+------------------------------------------------------------+------------------------+-------------------+
|1           |2        |[, 0, USER, 2020-12-11 04:50:40, 2020-12-11 04:50:40,]      |[2020-12-11 04:50:40, 0]|2020-12-11 04:50:40|
|1           |2        |[testA, 0, USER, 2020-12-11 04:50:40, 2020-12-11 17:18:00,] |null                    |2020-12-11 17:18:00|
|1           |2        |[testA, 0, USER, 2020-12-11 04:50:40, 2020-12-11 17:19:56,] |null                    |2020-12-11 17:19:56|
|1           |2        |[testA, 1, USER, 2020-12-11 04:50:40, 2020-12-11 17:20:24,] |[2020-12-11 17:20:24, 1]|2020-12-11 17:20:24|
|2           |3        |[testB, 0, USER, 2020-12-11 17:24:03, 2020-12-11 17:24:03,] |[2020-12-11 17:24:03, 0]|2020-12-11 17:24:03|
|3           |4        |[testC, 0, USER, 2020-12-11 17:27:36, 2020-12-11 17:27:36,] |[2020-12-11 17:27:36, 0]|2020-12-11 17:27:36|
+------------+---------+------------------------------------------------------------+------------------------+-------------------+

Schema is:架构是：

root
 |-- parent_key: long (nullable = true)
 |-- id: string (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- first_name: string (nullable = true)
 |    |-- is_active: integer (nullable = true)
 |    |-- source: string (nullable = true)
 |    |-- created_at: timestamp (nullable = true)
 |    |-- updated_at: timestamp (nullable = true)
 |-- raw_is_active: struct (nullable = true)
 |    |-- updated_at: timestamp (nullable = true)
 |    |-- value: integer (nullable = true)
 |-- updated_at: timestamp (nullable = true)

I am looking for an output:我正在寻找 output：

+------------+---------+------------------------------------------------------------+---------------------------------------------------+-------------------+
|parent_key  |id       |value                                                       |raw_is_active                                      |updated_at         |
+------------+---------+---------------------------------------------------------------------------------------------------------+--------------------------+
|1           |2        |[testA, 1, USER, 2020-12-11 04:50:40, 2020-12-11 17:20:24]  |[[2020-12-11 04:50:40, 0],[2020-12-11 17:20:24, 1]]|2020-12-11 04:50:40|
|2           |3        |[testB, 0, USER, 2020-12-11 17:24:03, 2020-12-11 17:24:03]  |[2020-12-11 17:24:03, 0]                           |2020-12-11 17:24:03|
|3           |4        |[testC, 0, USER, 2020-12-11 17:27:36, 2020-12-11 17:27:36]  |[2020-12-11 17:27:36, 0]                           |2020-12-11 17:27:36|
+------------+---------+---------------------------------------------------------------------------------------------------------+--------------------------+

So on the basis of the updated_at column I want to keep the latest row and also wants to create an array for raw_is_active for all the rows for a given id .因此，在updated_at列的基础上，我想保留最新的行，并且还想为给定id的所有行创建raw_is_active数组。

I know I can pick the latest value using code:我知道我可以使用代码选择最新value ：

 val windowSpec = Window.partitionBy("id").orderBy(col("updated_at").desc)

    dataFrame
      .withColumn("maxTS", first("updated_at").over(windowSpec))
      .select("*").where(col("maxTS") === col("updated_at"))
      .drop("maxTS")

But not sure how I can also create a set for raw_is_active column.但不确定我如何也可以为raw_is_active列创建一个集合。

Or I can completely use group by function like:或者我可以完全按 function 使用 group，例如：

 dataFrame
      .groupBy("parent_key", "id")
      .agg(collect_list("value") as "value_list", collect_set("raw_is_active") as "active_list")
      .withColumn("value", col("value_list")(size(col("value_list")).minus(1)))
      .drop("value_list")

For the above I am not sure对于上述我不确定

.withColumn("value", col("value_list")(size(col("value_list")).minus(1))) will always give me the latest value .withColumn("value", col("value_list")(size(col("value_list")).minus(1)))总是会给我最新的值
Considering use of collect_list and collect_set , is this code efficient?考虑使用collect_list和collect_set ，这段代码有效吗？

UPDATE Thanks to @mck, I was able to get it working with code:更新感谢@mck，我能够让它与代码一起工作：

val windowSpec = Window.partitionBy("id").orderBy(col("updated_at").desc)
val windowSpecSet = Window.partitionBy("id").orderBy(col("updated_at"))

val df2 = dataFrame.withColumn(
    "rn",
    row_number().over(windowSpec)
).withColumn(
    "active_list",
    collect_set("raw_is_active").over(windowSpecSet)
).drop("raw_is_active").filter("rn = 1")

However the code is taking more time than my existing code:但是，该代码比我现有的代码花费更多时间：

 dataFrame
      .groupBy("parent_key", "id")
      .agg(collect_list("value") as "value_list", collect_set("raw_is_active") as "active_list")
      .withColumn("value", col("value_list")(size(col("value_list")).minus(1)))
      .drop("value_list")

I was under impression that window function would perform better than groupBy and agg .我的印象是 window function 会比groupBy和agg表现更好。

Answer 1

Assign a row_number for each row in each id partition and filter the rows with row_number = 1 :为每个 id 分区中的每一行分配一个row_number并过滤row_number = 1的行：

val windowSpec = Window.partitionBy("id").orderBy(col("updated_at").desc)

val df2 = dataFrame.withColumn(
    "rn",
    row_number().over(windowSpec)
).withColumn(
    "active_list",
    array_sort(collect_set("raw_is_active").over(windowSpec))
).drop("raw_is_active").filter("rn = 1")

在火花中如何使用 window 规范与聚合函数

问题描述

1 个解决方案

解决方案1
1 2020-12-15 07:21:01

在火花中如何使用 window 规范与聚合函数

问题描述

1 个解决方案

解决方案1 1 2020-12-15 07:21:01

解决方案1
1 2020-12-15 07:21:01