I want to perform similar operation in pyspark like in how its possible with pandas
My dataframe is:
Year win_loss_date Deal L2 GFCID Name L2 GFCID GFCID GFCID Name Client Priority Location Deal Location Revenue Deal Conclusion New/Rebid
0 2021 2021-03-08 00:00:00 1-2JZONGU TEST GFCID CREATION P-1-P1DO P-1-P5O TEST GFCID CREATION None UNITED STATES UNITED STATES 4567.0000000 Won New
enter image description here In pandas: code to pivot is:
df = pd.pivot_table(deal_df_pandas,
index=['GFCID', 'GFCID Name', 'Client Priority'],
columns=['New/Rebid', 'Year', 'Deal Conclusion'],
aggfunc={'Deal':'count',
'Revenue':'sum',
'Location': lambda x: set(x),
'Deal Location': lambda x: set(x)}).reset_index()
columns=['New/Rebid', 'Year', 'Deal Conclusion'] ---These are the columns pivoted
Output I get and expected:
GFCID GFCID Name Client Priority Deal Revenue
New/Rebid New Rebid New Rebid
Year 2020 2021 2020 2021 2020 2021 2020 2021
Deal Conclusion Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won
0 0000000752 ARAMARK SERVICES INC Bronze NaN 1.0 1.0 2.0 NaN NaN NaN NaN NaN 1600000.0000000 20.0000000 20000.0000000 NaN NaN NaN NaN
enter image description here What i want is to convert above code to pyspark. what i am trying is not working:
from pyspark.sql import functions as F
df_pivot2=(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid').agg(F.first('Year'),F.first('Deal Conclusion'),F.count('Deal'),F.sum('Revenue'))
AS THIS OPERATION NOT POSSIBLE IN PySPARK:
(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid','Year','Deal Conclusion') #--error
you can concatenate the multiple columns into a single column which can be used within pivot
.
consider the following example
data_sdf.show()
# +---+-----+--------+--------+
# | id|state| time|expected|
# +---+-----+--------+--------+
# | 1| A|20220722| 1|
# | 1| A|20220723| 1|
# | 1| B|20220724| 2|
# | 2| B|20220722| 1|
# | 2| C|20220723| 2|
# | 2| B|20220724| 3|
# +---+-----+--------+--------+
data_sdf. \
withColumn('pivot_col', func.concat_ws('_', 'state', 'time')). \
groupBy('id'). \
pivot('pivot_col'). \
agg(func.sum('expected')). \
fillna(0). \
show()
# +---+----------+----------+----------+----------+----------+
# | id|A_20220722|A_20220723|B_20220722|B_20220724|C_20220723|
# +---+----------+----------+----------+----------+----------+
# | 1| 1| 1| 0| 2| 0|
# | 2| 0| 0| 1| 3| 2|
# +---+----------+----------+----------+----------+----------+
The input dataframe had 2 fields - state
and time
- that were to be pivoted. They were concatenated with a '_'
delimiter and used within pivot
. You can use multiple aggregations within the agg
, per your requirements, post that.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.