如何使用 PySpark 或 pandas 将 pivot 列变成行？

Question

I have a dataframe that looks like the one bellow, but with hundreds of rows.我有一个 dataframe 看起来像下面的那个，但有数百行。 I need to pivot it, so that each column after Region would be a row, like the other table bellow.我需要 pivot 它，这样Region之后的每一列都是一行，就像下面的另一个表一样。

+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|city          |city_tier | city_classification |  Region  | Jan-2022-orders  | Feb-2022-orders  |  Mar-2022-orders|
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|new york      | large    |    alpha            |   NE     | 100000           |195000            | 237000          |
|los angeles   | large    |    alpha            |   W      | 330000           |400000            | 580000          |

I need to pivot it using PySpark, so I end up with something like this:我需要 pivot 它使用 PySpark，所以我最终得到这样的结果：

+--------------+----------+---------------------+----------+-----------+---------+
|city          |city_tier | city_classification |  Region  | month     | orders  |
+--------------+----------+---------------------+----------+-----------+---------+
|new york      | large    |    alpha            |   NE     | Jan-2022  | 100000  |
|new york      | large    |    alpha            |   NE     | Fev-2022  | 195000  |
|new york      | large    |    alpha            |   NE     | Mar-2022  | 237000  |
|los angeles   | large    |    alpha            |   W      | Jan-2022  | 330000  |
|los angeles   | large    |    alpha            |   W      | Fev-2022  | 400000  |
|los angeles   | large    |    alpha            |   W      | Mar-2022  | 580000  |

PS: A solution using pandas would work too. PS：使用 pandas 的解决方案也可以。

Answer 1

In pandas:在 pandas 中：

df.melt(df.columns[:4], var_name = 'month', value_name = 'orders')

      city city_tier city_classification Region            month  orders
0     york     large               alpha     NE  Jan-2022-orders  100000
1  angeles     large               alpha      W  Jan-2022-orders  330000
2     york     large               alpha     NE  Feb-2022-orders  195000
3  angeles     large               alpha      W  Feb-2022-orders  400000
4     york     large               alpha     NE  Mar-2022-orders  237000
5  angeles     large               alpha      W  Mar-2022-orders  580000

or even甚至

df.melt(['city', 'city_tier', 'city_classification', 'Region'], 
         var_name = 'month', value_name = 'orders')


      city city_tier city_classification Region            month  orders
0     york     large               alpha     NE  Jan-2022-orders  100000
1  angeles     large               alpha      W  Jan-2022-orders  330000
2     york     large               alpha     NE  Feb-2022-orders  195000
3  angeles     large               alpha      W  Feb-2022-orders  400000
4     york     large               alpha     NE  Mar-2022-orders  237000
5  angeles     large               alpha      W  Mar-2022-orders  580000

Answer 2

In PySpark, your current example:在 PySpark 中，您当前的示例：

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [('new york', 'large', 'alpha', 'NE', 100000, 195000, 237000),
     ('los angeles', 'large', 'alpha', 'W', 330000, 400000, 580000)],
    ['city', 'city_tier', 'city_classification', 'Region', 'Jan-2022-orders', 'Feb-2022-orders', 'Mar-2022-orders']
)

df2 = df.select(
    'city', 'city_tier', 'city_classification', 'Region',
    F.expr("stack(3, 'Jan-2022', `Jan-2022-orders`, 'Fev-2022', `Feb-2022-orders`, 'Mar-2022', `Mar-2022-orders`) as (month, orders)")
)
df2.show()
# +-----------+---------+-------------------+------+--------+------+
# |       city|city_tier|city_classification|Region|   month|orders|
# +-----------+---------+-------------------+------+--------+------+
# |   new york|    large|              alpha|    NE|Jan-2022|100000|
# |   new york|    large|              alpha|    NE|Fev-2022|195000|
# |   new york|    large|              alpha|    NE|Mar-2022|237000|
# |los angeles|    large|              alpha|     W|Jan-2022|330000|
# |los angeles|    large|              alpha|     W|Fev-2022|400000|
# |los angeles|    large|              alpha|     W|Mar-2022|580000|
# +-----------+---------+-------------------+------+--------+------+

The function which enables it is stack .启用它的 function 是stack 。 It does not have a dataframe API, so you need to use expr to access it.它没有 dataframe API，因此您需要使用expr来访问它。

BTW, this is not pivoting, it's the opposite - unpivoting.顺便说一句，这不是旋转，而是相反——不旋转。

如何使用 PySpark 或 pandas 将 pivot 列变成行？

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-04-25 18:41:03

解决方案2
1 2022-04-25 18:23:48

如何使用 PySpark 或 pandas 将 pivot 列变成行？

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-04-25 18:41:03

解决方案2 1 2022-04-25 18:23:48

解决方案1
2 已采纳 2022-04-25 18:41:03

解决方案2
1 2022-04-25 18:23:48