[英]Rename pivoted and aggregated column in PySpark Dataframe
使用數據框如下:
from pyspark.sql.functions import avg, first
rdd = sc.parallelize(
[
(0, "A", 223,"201603", "PORT"),
(0, "A", 22,"201602", "PORT"),
(0, "A", 422,"201601", "DOCK"),
(1,"B", 3213,"201602", "DOCK"),
(1,"B", 3213,"201601", "PORT"),
(2,"C", 2321,"201601", "DOCK")
]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])
df_data.show()
我做一個支點,
df_data.groupby(df_data.id, df_data.type).pivot("date").agg(avg("cost"), first("ship")).show()
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
| id|type|201601_avg(cost)|201601_first(ship)()|201602_avg(cost)|201602_first(ship)()|201603_avg(cost)|201603_first(ship)()|
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
| 2| C| 2321.0| DOCK| null| null| null| null|
| 0| A| 422.0| DOCK| 22.0| PORT| 223.0| PORT|
| 1| B| 3213.0| PORT| 3213.0| DOCK| null| null|
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
但是我得到了這些非常復雜的列名稱。 在聚合上應用alias
通常是有效的,但由於在這種情況下的pivot
名稱更糟:
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
| id|type|201601_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201601_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201602_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201602_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201603_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201603_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
| 2| C| 2321.0| DOCK| null| null| null| null|
| 0| A| 422.0| DOCK| 22.0| PORT| 223.0| PORT|
| 1| B| 3213.0| PORT| 3213.0| DOCK| null| null|
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
有沒有辦法在數據透視和聚合中動態重命名列名?
一個簡單的正則表達式應該可以解決問題:
import re
def clean_names(df):
p = re.compile("^(\w+?)_([a-z]+)\((\w+)\)(?:\(\))?")
return df.toDF(*[p.sub(r"\1_\3", c) for c in df.columns])
pivoted = df_data.groupby(...).pivot(...).agg(...)
clean_names(pivoted).printSchema()
## root
## |-- id: long (nullable = true)
## |-- type: string (nullable = true)
## |-- 201601_cost: double (nullable = true)
## |-- 201601_ship: string (nullable = true)
## |-- 201602_cost: double (nullable = true)
## |-- 201602_ship: string (nullable = true)
## |-- 201603_cost: double (nullable = true)
## |-- 201603_ship: string (nullable = true)
如果要保留函數名稱,請將替換模式更改為例如\\1_\\2_\\3
。
一個簡單的方法是在聚合函數之后使用別名。 我從您創建的 df_data spark dataFrame 開始。
df_data.groupby(df_data.id, df_data.type).pivot("date").agg(avg("cost").alias("avg_cost"), first("ship").alias("first_ship")).show()
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+
| id|type|201601_avg_cost|201601_first_ship|201602_avg_cost|201602_first_ship|201603_avg_cost|201603_first_ship|
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+
| 1| B| 3213.0| PORT| 3213.0| DOCK| null| null|
| 2| C| 2321.0| DOCK| null| null| null| null|
| 0| A| 422.0| DOCK| 22.0| PORT| 223.0| PORT|
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+
列名將采用“original_column_name_aliased_column_name”的形式。 對於您的情況, original_column_name 將是 201601,aliased_column_name 將是 avg_cost,列名是 201601_avg_cost(由下划線“_”鏈接)。
您可以直接為聚合添加別名:
pivoted = df_data \
.groupby(df_data.id, df_data.type) \
.pivot("date") \
.agg(
avg('cost').alias('cost'),
first("ship").alias('ship')
)
pivoted.printSchema()
##root
##|-- id: long (nullable = true)
##|-- type: string (nullable = true)
##|-- 201601_cost: double (nullable = true)
##|-- 201601_ship: string (nullable = true)
##|-- 201602_cost: double (nullable = true)
##|-- 201602_ship: string (nullable = true)
##|-- 201603_cost: double (nullable = true)
##|-- 201603_ship: string (nullable = true)
編寫了一個簡單快速的函數來做到這一點。 享受! :)
# This function efficiently rename pivot tables' urgly names
def rename_pivot_cols(rename_df, remove_agg):
"""change spark pivot table's default ugly column names at ease.
Option 1: remove_agg = True: `2_sum(sum_amt)` --> `sum_amt_2`.
Option 2: remove_agg = False: `2_sum(sum_amt)` --> `sum_sum_amt_2`
"""
for column in rename_df.columns:
if remove_agg == True:
start_index = column.find('(')
end_index = column.find(')')
if (start_index > 0 and end_index > 0):
rename_df = rename_df.withColumnRenamed(column, column[start_index+1:end_index]+'_'+column[:1])
else:
new_column = column.replace('(','_').replace(')','')
rename_df = rename_df.withColumnRenamed(column, new_column[2:]+'_'+new_column[:1])
return rename_df
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.