How can I use the equivalent of pandas pivot_table() in pyspark?

Question

I have a pyspark dataFrame that i want to pivot.

input_dataframe:

mdn	top_protocol_by_vol	top_vol	rank
55555	AAA	30	1
55555	BBB	20	2
55555	DDD	10	3
9898	JJJ	30	1
9898	CCC	20	2
9898	FFF	10	3
2030	PPP	30	1
2030	KKK	20	2
2030	FFF	10	3

and I want to have something like this

output_dataframe:

mdn	top_protocol_by_vol_1	top_protocol_by_vol_2	top_protocol_by_vol_3	top_vol_1	top_vol_2	top_vol_3
2030	PPP	KKK	FFF	30	20	10
9898	JJJ	CCC	FFF	30	20	10
55555	AAA	BBB	DDD	30	20	10

I know for sure that i cant do soemting like this with Pandas using the code:

output_dataframe = input_dataframe.pivot_table(index='mdn', columns=['rank'],aggfunc=lambda x: ''.join(x) if isinstance(x, str) else x,dropna=True).reset_index()
output_dataframe.columns = [''.join('_'.join([str(c) for c in col if c != ""])) for col in output_dataframe.columns.values]

How can I achieve the same results with pyspark converting to pandas?

Answer 1

You can use pivot function with first as aggregate.

from pyspark.sql import functions as F

df = (df.groupby('mdn')
      .pivot('rank')
      .agg(F.first('top_protocol_by_vol').alias('top_protocol_by_vol'),
           F.first('top_vol').alias('top_vol')))

How can I use the equivalent of pandas pivot_table() in pyspark?

Question

1 answers

solution1
0 2022-12-19 17:19:25

How can I use the equivalent of pandas pivot_table() in pyspark?

Question

1 answers

solution1 0 2022-12-19 17:19:25

solution1
0 2022-12-19 17:19:25