简体   繁体   中英

Spark DataFrame aggregate and groupby multiple columns while retaining order

I have the following data

id | value1 | value2 
-----------------------
1         A       red
1         B       red
1         C      blue
2         A      blue
2         B      blue
2         C     green

The result I need is:

id |                       values
---------------------------------
 1      [[A,red],[B,red][C,blue]]
 2   [[A,blue],[B,blue][C,green]]

My approach so far is to groupby and aggregate value1 and value2 in seperate arrays and then merge them together as described in Combine PySpark DataFrame ArrayType fields into single ArrayType field

df.groupBy(["id"]).agg(*[F.collect_list("value1"), F.collect_list("value2")])

However since order is not guaranteed in collect_list() (see here ), how can I make sure value1 and value2 are both matched to the correct values?

This could potentially lead to two lists with different order and subsequent merging would match wrong values?

As commented by @Raphael, you can combine value1 and value2 columns into a single struct type column firstly, and then collect_list :

import pyspark.sql.functions as F

(df.withColumn('values', F.struct(df.value1, df.value2))
   .groupBy('id')
   .agg(F.collect_list('values').alias('values'))).show()

+---+--------------------+
| id|              values|
+---+--------------------+
|  1|[[A,red], [B,red]...|
|  2|[[A,blue], [B,blu...|
+---+--------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM