Spark DataFrame aggregate and groupby multiple columns while retaining order

Question

I have the following data

id | value1 | value2 
-----------------------
1         A       red
1         B       red
1         C      blue
2         A      blue
2         B      blue
2         C     green

The result I need is:

id |                       values
---------------------------------
 1      [[A,red],[B,red][C,blue]]
 2   [[A,blue],[B,blue][C,green]]

My approach so far is to groupby and aggregate value1 and value2 in seperate arrays and then merge them together as described in Combine PySpark DataFrame ArrayType fields into single ArrayType field

df.groupBy(["id"]).agg(*[F.collect_list("value1"), F.collect_list("value2")])

However since order is not guaranteed in collect_list() (see here ), how can I make sure value1 and value2 are both matched to the correct values?

This could potentially lead to two lists with different order and subsequent merging would match wrong values?

Answer 1

As commented by @Raphael, you can combine value1 and value2 columns into a single struct type column firstly, and then collect_list :

import pyspark.sql.functions as F

(df.withColumn('values', F.struct(df.value1, df.value2))
   .groupBy('id')
   .agg(F.collect_list('values').alias('values'))).show()

+---+--------------------+
| id|              values|
+---+--------------------+
|  1|[[A,red], [B,red]...|
|  2|[[A,blue], [B,blu...|
+---+--------------------+

Spark DataFrame aggregate and groupby multiple columns while retaining order

Question

1 answers

solution1
3 ACCPTED 2017-10-19 12:44:45

Spark DataFrame aggregate and groupby multiple columns while retaining order

Question

1 answers

solution1 3 ACCPTED 2017-10-19 12:44:45

solution1
3 ACCPTED 2017-10-19 12:44:45