Convert pyspark groupedData to pandas DataFrame

Question

I need to groupby via Spark a large dataset that I loaded as a two columns Pandas dataframe and then re-convert into Pandas: basically doing Pandas -> 'pyspark.sql.group.GroupedData' -> Pandas. Elements in both columns are integers, and the grouped data need to be stored in list format as follows:

df.a        df.b
1            3
2            5
3            8
1            2
3            1
2            6
...
spark_df = spark.createDataFrame(df)
spark_grouped_df = spark_df.groupBy('a')
type: <class 'pyspark.sql.group.GroupedData'>

At this point, I need to have is something like this as Pandas df (afterwards I need to do other operations more pandas friendly):

a        b
1    | [3,2]
2    | [5,6] 
3    | [8,1]
...

If using pandas, I would do this, but is too time consuming:

grouped_data = pd.DataFrame(df.groupby('a',as_index = True, sort = True)['b'].apply(list))

With Spark I'm sure would be way faster.

Any hints? Thanks!

Answer 1

You need to aggregate over grouped data. To get your output format, you can use collect_list function,

>>> from pyspark.sql.functions import collect_list
>>> pdf = spark_df.groupby('a').agg(collect_list('b').alias('b')).toPandas()
>>> pdf.head()
       a    b
    0  1  [3, 2]
    1  3  [8, 1]
    2  2  [5, 6]

Convert pyspark groupedData to pandas DataFrame

Question

1 answers

solution1
2 2017-08-03 11:52:46

Convert pyspark groupedData to pandas DataFrame

Question

1 answers

solution1 2 2017-08-03 11:52:46

solution1
2 2017-08-03 11:52:46