简体   繁体   English

Spark:GroupBy和collect_list同时按另一列过滤

[英]Spark: GroupBy and collect_list while filtering by another column

I have the following dataframe我有以下 dataframe

+-----+-----+------+
|group|label|active|
+-----+-----+------+
|    a|    1|     y|
|    a|    2|     y|
|    a|    1|     n|
|    b|    1|     y|
|    b|    1|     n|
+-----+-----+------+

I would like to group by "group" column and collect by "label" column, meanwhile filtering on the value in column active.我想按“组”列分组并按“标签”列收集,同时过滤活动列中的值。

The expected result would be预期的结果是

+-----+---------+---------+----------+
|group| labelyes| labelno |difference|
+-----+---------+---------+----------+
|a    | [1,2]   | [1]     | [2]      |
|b    | [1]     | [1]     | []       |
+-----+---------+---------+----------+

I could get easily filter for "y" label by我可以通过以下方式轻松过滤“y”label

val dfyes = df.filter($"active" === "y").groupBy("group").agg(collect_set("label"))

and similarly for the "n" value同样对于“n”值

val dfno = df.filter($"active" === "n").groupBy("group").agg(collect_set("label"))

but I don't understand if it's possible to aggregate simultaneously while filtering and how to get the difference of the two sets.但我不明白是否可以在过滤时同时聚合以及如何获得两组的差异。

You can do a pivot, and use some array functions to get the difference:您可以做一个 pivot,并使用一些数组函数来获得差异:

val df2 = df.groupBy("group").pivot("active").agg(collect_list("label")).withColumn(
    "difference", 
    array_union(
        array_except(col("n"), col("y")), 
        array_except(col("y"), col("n"))
    )
)

df2.show
+-----+---+------+----------+
|group|  n|     y|difference|
+-----+---+------+----------+
|    b|[1]|   [1]|        []|
|    a|[1]|[1, 2]|       [2]|
+-----+---+------+----------+

Thanks @mck for his help.感谢@mck 的帮助。 I have found an alternative way to solve the question, namely to filter with when during the aggregation:我找到了另一种解决问题的方法, when在聚合期间过滤:

df
   .groupBy("group")
   .agg(
        collect_set(when($"active" === "y", $"label")).as("labelyes"), 
        collect_set(when($"active" === "n", $"label")).as("labelno")
       )
.withColumn("diff", array_except($"labelyes", $"labelno"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何处理由 groupBy&collect_list 创建的数组? - How can I deal with arrays created by groupBy&collect_list? 根据另一个列条件过滤列值 - Filtering column values based on another column condition 参考另一个数组列的 Spark 数据帧聚合 - Spark dataframe aggregation with reference to another array column 如何使用 pandas 将数据标准化,同时将结果按另一列分组 - how to use pandas to standardlize data while groupby the result by another columns 如何将 PySpark dataframe 列转换为基于 groupBy 列的字典列表 - How to convert PySpark dataframe columns into list of dictionary based on groupBy column 使用 Spark Dataframe (Scala) 中的另一列数组创建一列数组 - Creating a column of array using another column of array in a Spark Dataframe (Scala) Swift数组:在使用另一个过滤数组时保持顺序 - Swift Arrays: Keeping the order while filtering an array using another 根据内容过滤字符串列表时出错-python - Error while filtering a list of strings based on contents - python 过滤列表类型data.table列中的数组 - Filtering of an array in a list type data.table column 如何使用pyspark将列表数组作为新列添加到spark数据帧 - How to add an array of list as a new column to a spark dataframe using pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM