简体   繁体   中英

Convert multiple rows into one row with multiple columns in pyspark?

I have something like this (I've simplified the number of columns for brevity, there's about 10 other attributes):

id    name    foods    foods_eaten    color  continent
1     john    apples   2              red     Europe
1     john    oranges  3              red     Europe
2     jack    apples   1              blue    North America

I want to convert it to:

id    name    apples    oranges    color    continent 
1     john    2         3          red       Europe
2     jack    1         0          blue      North America

Edit:

(1) I updated the data to show a few more of the columns.

(3) I've done

df_piv = df.groupBy(['id', 'name', 'color', 'continent', ...]).pivot('foods').avg('foods_eaten')

Is there a simpler way to do this sort of thing? As far as I can tell, I'll need to groupby almost every attribute to get my result.

Extending from what you have done so far and leveraging here

>>>from pyspark.sql import functions as F
>>>from pyspark.sql.types import *
>>>from pyspark.sql.functions import collect_list
>>>data=[{'id':1,'name':'john','foods':"apples"},{'id':1,'name':'john','foods':"oranges"},{'id':2,'name':'jack','foods':"banana"}]
>>>dataframe=spark.createDataFrame(data)
>>>dataframe.show()
+-------+---+----+
|  foods| id|name|
+-------+---+----+
| apples|  1|john|
|oranges|  1|john|
| banana|  2|jack|
+-------+---+----+
>>>grouping_cols = ["id","name"]
>>>other_cols = [c for c in dataframe.columns if c not in grouping_cols]
>>> df=dataframe.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols])
>>>df.show()
+---+----+-----------------+
| id|name|            foods|
+---+----+-----------------+
|  1|john|[apples, oranges]|
|  2|jack|         [banana]|
+---+----+-----------------+

>>>df_sizes = df.select(*[F.size(col).alias(col) for col in other_cols])
>>>df_max = df_sizes.agg(*[F.max(col).alias(col) for col in other_cols])
>>> max_dict = df_max.collect()[0].asDict()

>>>df_result = df.select('id','name', *[df[col][i] for col in other_cols for i in range(max_dict[col])])
>>>df_result.show()
+---+----+--------+--------+
| id|name|foods[0]|foods[1]|
+---+----+--------+--------+
|  1|john|  apples| oranges|
|  2|jack|  banana|    null|
+---+----+--------+--------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM