简体   繁体   中英

How to Pivot multiple columns in pyspark similar to pandas

I want to perform similar operation in pyspark like in how its possible with pandas

My dataframe is:

Year    win_loss_date   Deal    L2 GFCID Name   L2 GFCID    GFCID   GFCID Name  Client Priority Location    Deal Location   Revenue Deal Conclusion New/Rebid
0   2021    2021-03-08 00:00:00 1-2JZONGU   TEST GFCID CREATION P-1-P1DO    P-1-P5O TEST GFCID CREATION None    UNITED STATES   UNITED STATES   4567.0000000    Won New

enter image description here In pandas: code to pivot is:

df = pd.pivot_table(deal_df_pandas, 
                      index=['GFCID', 'GFCID Name', 'Client Priority'], 
                      columns=['New/Rebid', 'Year', 'Deal Conclusion'], 
                      aggfunc={'Deal':'count',
                               'Revenue':'sum',
                               'Location': lambda x: set(x),
                               'Deal Location': lambda x: set(x)}).reset_index()

columns=['New/Rebid', 'Year', 'Deal Conclusion'] ---These are the columns pivoted

Output I get and expected:

GFCID   GFCID Name  Client Priority Deal    Revenue
New/Rebid               New Rebid   New Rebid
Year                2020    2021    2020    2021    2020    2021    2020    2021
Deal Conclusion             Lost    Won Lost    Won Lost    Won Lost    Won Lost    Won Lost    Won Lost    Won Lost    Won
    0   0000000752  ARAMARK SERVICES INC    Bronze  NaN 1.0 1.0 2.0 NaN NaN NaN NaN NaN 1600000.0000000 20.0000000  20000.0000000   NaN NaN NaN NaN

enter image description here What i want is to convert above code to pyspark. what i am trying is not working:

from pyspark.sql import functions as F
    df_pivot2=(df_d1
        .groupby('GFCID', 'GFCID Name', 'Client Priority')
        .pivot('New/Rebid').agg(F.first('Year'),F.first('Deal Conclusion'),F.count('Deal'),F.sum('Revenue'))

AS THIS OPERATION NOT POSSIBLE IN PySPARK:

(df_d1
    .groupby('GFCID', 'GFCID Name', 'Client Priority')
    .pivot('New/Rebid','Year','Deal Conclusion')  #--error

you can concatenate the multiple columns into a single column which can be used within pivot .

consider the following example

data_sdf.show()

# +---+-----+--------+--------+
# | id|state|    time|expected|
# +---+-----+--------+--------+
# |  1|    A|20220722|       1|
# |  1|    A|20220723|       1|
# |  1|    B|20220724|       2|
# |  2|    B|20220722|       1|
# |  2|    C|20220723|       2|
# |  2|    B|20220724|       3|
# +---+-----+--------+--------+

data_sdf. \
    withColumn('pivot_col', func.concat_ws('_', 'state', 'time')). \
    groupBy('id'). \
    pivot('pivot_col'). \
    agg(func.sum('expected')). \
    fillna(0). \
    show()

# +---+----------+----------+----------+----------+----------+
# | id|A_20220722|A_20220723|B_20220722|B_20220724|C_20220723|
# +---+----------+----------+----------+----------+----------+
# |  1|         1|         1|         0|         2|         0|
# |  2|         0|         0|         1|         3|         2|
# +---+----------+----------+----------+----------+----------+

The input dataframe had 2 fields - state and time - that were to be pivoted. They were concatenated with a '_' delimiter and used within pivot . You can use multiple aggregations within the agg , per your requirements, post that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM