[英]Spark: GroupBy and collect_list while filtering by another column
I have the following dataframe我有以下 dataframe
+-----+-----+------+
|group|label|active|
+-----+-----+------+
| a| 1| y|
| a| 2| y|
| a| 1| n|
| b| 1| y|
| b| 1| n|
+-----+-----+------+
I would like to group by "group" column and collect by "label" column, meanwhile filtering on the value in column active.我想按“组”列分组并按“标签”列收集,同时过滤活动列中的值。
The expected result would be预期的结果是
+-----+---------+---------+----------+
|group| labelyes| labelno |difference|
+-----+---------+---------+----------+
|a | [1,2] | [1] | [2] |
|b | [1] | [1] | [] |
+-----+---------+---------+----------+
I could get easily filter for "y" label by我可以通过以下方式轻松过滤“y”label
val dfyes = df.filter($"active" === "y").groupBy("group").agg(collect_set("label"))
and similarly for the "n" value同样对于“n”值
val dfno = df.filter($"active" === "n").groupBy("group").agg(collect_set("label"))
but I don't understand if it's possible to aggregate simultaneously while filtering and how to get the difference of the two sets.但我不明白是否可以在过滤时同时聚合以及如何获得两组的差异。
You can do a pivot, and use some array functions to get the difference:您可以做一个 pivot,并使用一些数组函数来获得差异:
val df2 = df.groupBy("group").pivot("active").agg(collect_list("label")).withColumn(
"difference",
array_union(
array_except(col("n"), col("y")),
array_except(col("y"), col("n"))
)
)
df2.show
+-----+---+------+----------+
|group| n| y|difference|
+-----+---+------+----------+
| b|[1]| [1]| []|
| a|[1]|[1, 2]| [2]|
+-----+---+------+----------+
Thanks @mck for his help.感谢@mck 的帮助。 I have found an alternative way to solve the question, namely to filter with
when
during the aggregation:我找到了另一种解决问题的方法,
when
在聚合期间过滤:
df
.groupBy("group")
.agg(
collect_set(when($"active" === "y", $"label")).as("labelyes"),
collect_set(when($"active" === "n", $"label")).as("labelno")
)
.withColumn("diff", array_except($"labelyes", $"labelno"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.