[英]Spark DataFrame aggregate and groupby multiple columns while retaining order
I have the following data 我有以下数据
id | value1 | value2
-----------------------
1 A red
1 B red
1 C blue
2 A blue
2 B blue
2 C green
The result I need is: 我需要的结果是:
id | values
---------------------------------
1 [[A,red],[B,red][C,blue]]
2 [[A,blue],[B,blue][C,green]]
My approach so far is to groupby and aggregate value1 and value2 in seperate arrays and then merge them together as described in Combine PySpark DataFrame ArrayType fields into single ArrayType field 到目前为止,我的方法是在单独的数组中对value1和value2进行分组和聚合,然后将它们合并在一起,如将PySpark DataFrame ArrayType字段组合到单个ArrayType字段中所述
df.groupBy(["id"]).agg(*[F.collect_list("value1"), F.collect_list("value2")])
However since order is not guaranteed in collect_list()
(see here ), how can I make sure value1 and value2 are both matched to the correct values? 但是,由于在collect_list()
无法保证顺序(请参见此处 ),如何确保value1和value2都匹配正确的值?
This could potentially lead to two lists with different order and subsequent merging would match wrong values? 这可能会导致两个列表具有不同的顺序,后续合并会匹配错误的值?
As commented by @Raphael, you can combine value1 and value2 columns into a single struct
type column firstly, and then collect_list
: 正如@Raphael所评论的,您可以先将value1和value2列合并为一个struct
类型列,然后再collect_list
:
import pyspark.sql.functions as F
(df.withColumn('values', F.struct(df.value1, df.value2))
.groupBy('id')
.agg(F.collect_list('values').alias('values'))).show()
+---+--------------------+
| id| values|
+---+--------------------+
| 1|[[A,red], [B,red]...|
| 2|[[A,blue], [B,blu...|
+---+--------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.