Spark DataFrame在保留顺序的同时聚合和分组多列

Question

I have the following data 我有以下数据

id | value1 | value2 
-----------------------
1         A       red
1         B       red
1         C      blue
2         A      blue
2         B      blue
2         C     green

The result I need is: 我需要的结果是：

id |                       values
---------------------------------
 1      [[A,red],[B,red][C,blue]]
 2   [[A,blue],[B,blue][C,green]]

My approach so far is to groupby and aggregate value1 and value2 in seperate arrays and then merge them together as described in Combine PySpark DataFrame ArrayType fields into single ArrayType field 到目前为止，我的方法是在单独的数组中对value1和value2进行分组和聚合，然后将它们合并在一起，如将PySpark DataFrame ArrayType字段组合到单个ArrayType字段中所述

df.groupBy(["id"]).agg(*[F.collect_list("value1"), F.collect_list("value2")])

However since order is not guaranteed in collect_list() (see here ), how can I make sure value1 and value2 are both matched to the correct values? 但是，由于在collect_list()无法保证顺序（请参见此处），如何确保value1和value2都匹配正确的值？

This could potentially lead to two lists with different order and subsequent merging would match wrong values? 这可能会导致两个列表具有不同的顺序，后续合并会匹配错误的值？

Answer 1

As commented by @Raphael, you can combine value1 and value2 columns into a single struct type column firstly, and then collect_list : 正如@Raphael所评论的，您可以先将value1和value2列合并为一个struct类型列，然后再collect_list ：

import pyspark.sql.functions as F

(df.withColumn('values', F.struct(df.value1, df.value2))
   .groupBy('id')
   .agg(F.collect_list('values').alias('values'))).show()

+---+--------------------+
| id|              values|
+---+--------------------+
|  1|[[A,red], [B,red]...|
|  2|[[A,blue], [B,blu...|
+---+--------------------+

Spark DataFrame在保留顺序的同时聚合和分组多列

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-10-19 12:44:45

Spark DataFrame在保留顺序的同时聚合和分组多列

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-10-19 12:44:45

解决方案1
3 已采纳 2017-10-19 12:44:45