Spark：GroupBy和collect_list同时按另一列过滤

Question

I have the following dataframe我有以下 dataframe

+-----+-----+------+
|group|label|active|
+-----+-----+------+
|    a|    1|     y|
|    a|    2|     y|
|    a|    1|     n|
|    b|    1|     y|
|    b|    1|     n|
+-----+-----+------+

I would like to group by "group" column and collect by "label" column, meanwhile filtering on the value in column active.我想按“组”列分组并按“标签”列收集，同时过滤活动列中的值。

The expected result would be预期的结果是

+-----+---------+---------+----------+
|group| labelyes| labelno |difference|
+-----+---------+---------+----------+
|a    | [1,2]   | [1]     | [2]      |
|b    | [1]     | [1]     | []       |
+-----+---------+---------+----------+

I could get easily filter for "y" label by我可以通过以下方式轻松过滤“y”label

val dfyes = df.filter($"active" === "y").groupBy("group").agg(collect_set("label"))

and similarly for the "n" value同样对于“n”值

val dfno = df.filter($"active" === "n").groupBy("group").agg(collect_set("label"))

but I don't understand if it's possible to aggregate simultaneously while filtering and how to get the difference of the two sets.但我不明白是否可以在过滤时同时聚合以及如何获得两组的差异。

Answer 1

You can do a pivot, and use some array functions to get the difference:您可以做一个 pivot，并使用一些数组函数来获得差异：

val df2 = df.groupBy("group").pivot("active").agg(collect_list("label")).withColumn(
    "difference", 
    array_union(
        array_except(col("n"), col("y")), 
        array_except(col("y"), col("n"))
    )
)

df2.show
+-----+---+------+----------+
|group|  n|     y|difference|
+-----+---+------+----------+
|    b|[1]|   [1]|        []|
|    a|[1]|[1, 2]|       [2]|
+-----+---+------+----------+

Answer 2

Thanks @mck for his help.感谢@mck 的帮助。 I have found an alternative way to solve the question, namely to filter with when during the aggregation:我找到了另一种解决问题的方法， when在聚合期间过滤：

df
   .groupBy("group")
   .agg(
        collect_set(when($"active" === "y", $"label")).as("labelyes"), 
        collect_set(when($"active" === "n", $"label")).as("labelno")
       )
.withColumn("diff", array_except($"labelyes", $"labelno"))

Spark：GroupBy和collect_list同时按另一列过滤

问题描述

2 个解决方案

解决方案1
1 2021-03-30 19:36:34

解决方案2
0 已采纳 2021-03-30 19:48:12

Spark：GroupBy和collect_list同时按另一列过滤

问题描述

2 个解决方案

解决方案1 1 2021-03-30 19:36:34

解决方案2 0 已采纳 2021-03-30 19:48:12

解决方案1
1 2021-03-30 19:36:34

解决方案2
0 已采纳 2021-03-30 19:48:12