简体   繁体   English

如何 Pivot pyspark 中的多列类似于 pandas

[英]How to Pivot multiple columns in pyspark similar to pandas

I want to perform similar operation in pyspark like in how its possible with pandas我想在 pyspark 中执行类似的操作,就像 pandas 一样

My dataframe is:我的 dataframe 是:

Year    win_loss_date   Deal    L2 GFCID Name   L2 GFCID    GFCID   GFCID Name  Client Priority Location    Deal Location   Revenue Deal Conclusion New/Rebid
0   2021    2021-03-08 00:00:00 1-2JZONGU   TEST GFCID CREATION P-1-P1DO    P-1-P5O TEST GFCID CREATION None    UNITED STATES   UNITED STATES   4567.0000000    Won New

enter image description here In pandas: code to pivot is:在此处输入图像描述在 pandas 中: pivot 的代码为:

df = pd.pivot_table(deal_df_pandas, 
                      index=['GFCID', 'GFCID Name', 'Client Priority'], 
                      columns=['New/Rebid', 'Year', 'Deal Conclusion'], 
                      aggfunc={'Deal':'count',
                               'Revenue':'sum',
                               'Location': lambda x: set(x),
                               'Deal Location': lambda x: set(x)}).reset_index()

columns=['New/Rebid', 'Year', 'Deal Conclusion'] ---These are the columns pivoted columns=['New/Rebid', 'Year', 'Deal结论'] ---这些是旋转的列

Output I get and expected: Output 我得到并期望:

GFCID   GFCID Name  Client Priority Deal    Revenue
New/Rebid               New Rebid   New Rebid
Year                2020    2021    2020    2021    2020    2021    2020    2021
Deal Conclusion             Lost    Won Lost    Won Lost    Won Lost    Won Lost    Won Lost    Won Lost    Won Lost    Won
    0   0000000752  ARAMARK SERVICES INC    Bronze  NaN 1.0 1.0 2.0 NaN NaN NaN NaN NaN 1600000.0000000 20.0000000  20000.0000000   NaN NaN NaN NaN

enter image description here What i want is to convert above code to pyspark.在此处输入图像描述我想要的是将上述代码转换为 pyspark。 what i am trying is not working:我正在尝试的不起作用:

from pyspark.sql import functions as F
    df_pivot2=(df_d1
        .groupby('GFCID', 'GFCID Name', 'Client Priority')
        .pivot('New/Rebid').agg(F.first('Year'),F.first('Deal Conclusion'),F.count('Deal'),F.sum('Revenue'))

AS THIS OPERATION NOT POSSIBLE IN PySPARK:由于 PySPARK 中无法执行此操作:

(df_d1
    .groupby('GFCID', 'GFCID Name', 'Client Priority')
    .pivot('New/Rebid','Year','Deal Conclusion')  #--error

you can concatenate the multiple columns into a single column which can be used within pivot .您可以将多列连接成一列,可以在pivot中使用。

consider the following example考虑下面的例子

data_sdf.show()

# +---+-----+--------+--------+
# | id|state|    time|expected|
# +---+-----+--------+--------+
# |  1|    A|20220722|       1|
# |  1|    A|20220723|       1|
# |  1|    B|20220724|       2|
# |  2|    B|20220722|       1|
# |  2|    C|20220723|       2|
# |  2|    B|20220724|       3|
# +---+-----+--------+--------+

data_sdf. \
    withColumn('pivot_col', func.concat_ws('_', 'state', 'time')). \
    groupBy('id'). \
    pivot('pivot_col'). \
    agg(func.sum('expected')). \
    fillna(0). \
    show()

# +---+----------+----------+----------+----------+----------+
# | id|A_20220722|A_20220723|B_20220722|B_20220724|C_20220723|
# +---+----------+----------+----------+----------+----------+
# |  1|         1|         1|         0|         2|         0|
# |  2|         0|         0|         1|         3|         2|
# +---+----------+----------+----------+----------+----------+

The input dataframe had 2 fields - state and time - that were to be pivoted.输入 dataframe 有 2 个字段 - statetime - 将被旋转。 They were concatenated with a '_' delimiter and used within pivot .它们与'_'分隔符连接并在pivot中使用。 You can use multiple aggregations within the agg , per your requirements, post that.您可以根据您的要求在agg中使用多个聚合,然后发布。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM