pivot dataframe 在 pyspark

Question

我有 DF 測試包含以下列

Type  Name  Country      Year    Value
1     Rec      US        2018      8
2     fg       UK        2019      2
5     vd      India      2020      1
7     se       US        2021      3

我想在它上面制作 pivot 我試過下面的表達式pivotdata=spark.sql("select * from test").groupby("Country").pivot("Year").sum("Value").show()

我得到 output 但它只顯示了幾列，除了剩下的兩列

Country  2018  2019  2020  2021
US        -     -
UK        -      -
India     -      -
US        -      -

那么如果我想要所有列，我們該怎么辦

Answer 1

如果我正確理解您的需求，您還必須在 sum() 中提供其他列。 考慮下面的例子：

tst=sqlContext.createDataFrame([('2020-04-23',1,2,"india"),('2020-04-24',1,3,"india"),('2020-04-23',1,4,"china"),('2020-04-24',1,5,"china"),('2020-04-23',1,7,"germany"),('2020-04-24',1,9,"germany")],schema=('date','quantity','value','country'))
tst.show()
+----------+--------+-----+-------+
|      date|quantity|value|country|
+----------+--------+-----+-------+
|2020-04-23|       1|    2|  india|
|2020-04-24|       1|    3|  india|
|2020-04-23|       1|    4|  china|
|2020-04-24|       1|    5|  china|
|2020-04-23|       1|    7|germany|
|2020-04-24|       1|    9|germany|
+----------+--------+-----+-------+
df_pivot=tst.groupby('country').pivot('date').sum('quantity','value').show()
df_pivot.show()
+-------+------------------------+---------------------+------------------------+---------------------+
|country|2020-04-23_sum(quantity)|2020-04-23_sum(value)|2020-04-24_sum(quantity)|2020-04-24_sum(value)|
+-------+------------------------+---------------------+------------------------+---------------------+
|germany|                       1|                    7|                       1|                    9|
|  china|                       1|                    4|                       1|                    5|
|  india|                       1|                    2|                       1|                    3|
+-------+------------------------+---------------------+------------------------+---------------------+

如果您不喜歡有趣的列名，那么您可以使用 agg function 來為旋轉的列名定義自己的后綴。

tst_res=tst.groupby('country').pivot('date').agg(F.sum('quantity').alias('sum_quantity'),F.sum('value').alias('sum_value'))
tst_res.show()
+-------+-----------------------+--------------------+-----------------------+--------------------+
|country|2020-04-23_sum_quantity|2020-04-23_sum_value|2020-04-24_sum_quantity|2020-04-24_sum_value|
+-------+-----------------------+--------------------+-----------------------+--------------------+
|germany|                      1|                   7|                      1|                   9|
|  china|                      1|                   4|                      1|                   5|
|  india|                      1|                   2|                      1|                   3|
+-------+-----------------------+--------------------+-----------------------+--------------------+

pivot dataframe 在 pyspark

問題描述

1 個解決方案

解決方案1
2 已采納 2020-06-24 06:01:19

pivot dataframe 在 pyspark

問題描述

1 個解決方案

解決方案1 2 已采納 2020-06-24 06:01:19

解決方案1
2 已采納 2020-06-24 06:01:19