重命名 PySpark Dataframe 中的透视和聚合列

Question

使用数据框如下：

from pyspark.sql.functions import avg, first

rdd = sc.parallelize(
    [
        (0, "A", 223,"201603", "PORT"), 
        (0, "A", 22,"201602", "PORT"), 
        (0, "A", 422,"201601", "DOCK"), 
        (1,"B", 3213,"201602", "DOCK"), 
        (1,"B", 3213,"201601", "PORT"), 
        (2,"C", 2321,"201601", "DOCK")
    ]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])

df_data.show()

我做一个支点，

df_data.groupby(df_data.id, df_data.type).pivot("date").agg(avg("cost"), first("ship")).show()

+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
| id|type|201601_avg(cost)|201601_first(ship)()|201602_avg(cost)|201602_first(ship)()|201603_avg(cost)|201603_first(ship)()|
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
|  2|   C|          2321.0|                DOCK|            null|                null|            null|                null|
|  0|   A|           422.0|                DOCK|            22.0|                PORT|           223.0|                PORT|
|  1|   B|          3213.0|                PORT|          3213.0|                DOCK|            null|                null|
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+

但是我得到了这些非常复杂的列名称。 在聚合上应用alias通常是有效的，但由于在这种情况下的pivot名称更糟：

+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
| id|type|201601_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201601_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201602_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201602_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201603_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201603_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
|  2|   C|                                                        2321.0|                                                              DOCK|                                                          null|                                                              null|                                                          null|                                                              null|
|  0|   A|                                                         422.0|                                                              DOCK|                                                          22.0|                                                              PORT|                                                         223.0|                                                              PORT|
|  1|   B|                                                        3213.0|                                                              PORT|                                                        3213.0|                                                              DOCK|                                                          null|                                                              null|
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+

有没有办法在数据透视和聚合中动态重命名列名？

Answer 1

一个简单的正则表达式应该可以解决问题：

import re

def clean_names(df):
    p = re.compile("^(\w+?)_([a-z]+)\((\w+)\)(?:\(\))?")
    return df.toDF(*[p.sub(r"\1_\3", c) for c in df.columns])

pivoted = df_data.groupby(...).pivot(...).agg(...)

clean_names(pivoted).printSchema()
## root
##  |-- id: long (nullable = true)
##  |-- type: string (nullable = true)
##  |-- 201601_cost: double (nullable = true)
##  |-- 201601_ship: string (nullable = true)
##  |-- 201602_cost: double (nullable = true)
##  |-- 201602_ship: string (nullable = true)
##  |-- 201603_cost: double (nullable = true)
##  |-- 201603_ship: string (nullable = true)

如果要保留函数名称，请将替换模式更改为例如\\1_\\2_\\3 。

Answer 2

一个简单的方法是在聚合函数之后使用别名。 我从您创建的 df_data spark dataFrame 开始。

df_data.groupby(df_data.id, df_data.type).pivot("date").agg(avg("cost").alias("avg_cost"), first("ship").alias("first_ship")).show()
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+
| id|type|201601_avg_cost|201601_first_ship|201602_avg_cost|201602_first_ship|201603_avg_cost|201603_first_ship|
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+
|  1|   B|         3213.0|             PORT|         3213.0|             DOCK|           null|             null|
|  2|   C|         2321.0|             DOCK|           null|             null|           null|             null|
|  0|   A|          422.0|             DOCK|           22.0|             PORT|          223.0|             PORT|
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+

列名将采用“original_column_name_aliased_column_name”的形式。 对于您的情况， original_column_name 将是 201601，aliased_column_name 将是 avg_cost，列名是 201601_avg_cost（由下划线“_”链接）。

Answer 3

您可以直接为聚合添加别名：

pivoted = df_data \
    .groupby(df_data.id, df_data.type) \
    .pivot("date") \
    .agg(
       avg('cost').alias('cost'),
       first("ship").alias('ship')
    )

pivoted.printSchema()
##root
##|-- id: long (nullable = true)
##|-- type: string (nullable = true)
##|-- 201601_cost: double (nullable = true)
##|-- 201601_ship: string (nullable = true)
##|-- 201602_cost: double (nullable = true)
##|-- 201602_ship: string (nullable = true)
##|-- 201603_cost: double (nullable = true)
##|-- 201603_ship: string (nullable = true)

Answer 4

编写了一个简单快速的函数来做到这一点。 享受！ :)

# This function efficiently rename pivot tables' urgly names
def rename_pivot_cols(rename_df, remove_agg):
    """change spark pivot table's default ugly column names at ease.
        Option 1: remove_agg = True: `2_sum(sum_amt)` --> `sum_amt_2`.
        Option 2: remove_agg = False: `2_sum(sum_amt)` --> `sum_sum_amt_2`
    """
    for column in rename_df.columns:
        if remove_agg == True:
            start_index = column.find('(')
            end_index = column.find(')')
            if (start_index > 0 and end_index > 0):
                rename_df = rename_df.withColumnRenamed(column, column[start_index+1:end_index]+'_'+column[:1])
        else:
            new_column = column.replace('(','_').replace(')','')
            rename_df = rename_df.withColumnRenamed(column, new_column[2:]+'_'+new_column[:1])   
    return rename_df

Answer 5

来自 zero323 的修改版本，用于 spark 2.4

import re

def clean_names(df):
    p = re.compile("^(\w+?)_([a-z]+)\((\w+)(,\s\w+)\)(:\s\w+)?")
    return df.toDF(*[p.sub(r"\1_\3", c) for c in df.columns])

当前列名就像0_first(is_flashsale, false): int

重命名 PySpark Dataframe 中的透视和聚合列

问题描述

5 个解决方案

解决方案1
8 已采纳 2016-06-15 18:21:38

解决方案2
6 2017-10-25 17:55:52

解决方案3
2 2017-06-08 15:21:25

解决方案4
0 2019-02-28 19:31:38

解决方案5
0 2020-06-01 08:32:10

重命名 PySpark Dataframe 中的透视和聚合列

问题描述

5 个解决方案

解决方案1 8 已采纳 2016-06-15 18:21:38

解决方案2 6 2017-10-25 17:55:52

解决方案3 2 2017-06-08 15:21:25

解决方案4 0 2019-02-28 19:31:38

解决方案5 0 2020-06-01 08:32:10

解决方案1
8 已采纳 2016-06-15 18:21:38

解决方案2
6 2017-10-25 17:55:52

解决方案3
2 2017-06-08 15:21:25

解决方案4
0 2019-02-28 19:31:38

解决方案5
0 2020-06-01 08:32:10