[英]Pyspark Expand One Row into Multiple Rows By Column Header
假設我有一個包含以下各列的數據框:
# id | name | 01-Jan-10 | 01-Feb-10 | ... | 01-Jan-11 | 01-Feb-11
# -----------------------------------------------------------------
# 1 | a001 | 0 | 32 | ... | 14 | 108
# 1 | a002 | 80 | 0 | ... | 0 | 92
我想將其擴展為這樣的表:
# id | name | Jan | Feb | ... | Year
# -----------------------------------
# 1 | a001 | 0 | 32 | ... | 2010
# 1 | a001 | 14 | 108 | ... | 2011
# 1 | a002 | 80 | 0 | ... | 2010
# 1 | a002 | 0 | 92 | ... | 2011
我想按年將日期分為幾行,並捕獲每月的值。
在pyspark(python + spark)中如何實現? 我一直在嘗試收集df數據以進行迭代,並提取每個字段以寫入每一行,但我想知道是否有更聰明的spark函數對此有所幫助。 (新火花)
首先melt
DataFrame
( 如何融化Spark DataFrame? ):
df = spark.createDataFrame(
[(1, "a001", 0, 32, 14, 108), (2, "a02", 80, 0, 0, 92)],
("id", "name", "01-Jan-10", "01-Feb-10", "01-Jan-11", "01-Feb-11")
)
df_long = melt(df, df.columns[:2], df.columns[2:])
# +---+----+---------+-----+
# | id|name| variable|value|
# +---+----+---------+-----+
# | 1|a001|01-Jan-10| 0|
# | 1|a001|01-Feb-10| 32|
# | 1|a001|01-Jan-11| 14|
# | 1|a001|01-Feb-11| 108|
# | 2| a02|01-Jan-10| 80|
# | 2| a02|01-Feb-10| 0|
# | 2| a02|01-Jan-11| 0|
# | 2| a02|01-Feb-11| 92|
# +---+----+---------+-----+
下一個解析日期並提取年份和月份:
from pyspark.sql.functions import to_date, date_format, year
date = to_date("variable", "dd-MMM-yy")
parsed = df_long.select(
"id", "name", "value",
year(date).alias("year"), date_format(date, "MMM").alias("month")
)
# +---+----+-----+----+-----+
# | id|name|value|year|month|
# +---+----+-----+----+-----+
# | 1|a001| 0|2010| Jan|
# | 1|a001| 32|2010| Feb|
# | 1|a001| 14|2011| Jan|
# | 1|a001| 108|2011| Feb|
# | 2| a02| 80|2010| Jan|
# | 2| a02| 0|2010| Feb|
# | 2| a02| 0|2011| Jan|
# | 2| a02| 92|2011| Feb|
# +---+----+-----+----+-----+
最后pivot
( 如何樞轉Spark DataFrame? ):
# Providing a list of levels is not required but will make the process faster
# months = [
# "Jan", "Feb", "Mar", "Apr", "May", "Jun",
# "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
# ]
months = ["Jan", "Feb"]
parsed.groupBy("id", "name", "year").pivot("month", months).sum("value")
# +---+----+----+---+---+
# | id|name|year|Feb|Jan|
# +---+----+----+---+---+
# | 2| a02|2011| 92| 0|
# | 1|a001|2010| 32| 0|
# | 1|a001|2011|108| 14|
# | 2| a02|2010| 0| 80|
# +---+----+----+---+---+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.