[英]PySpark: How to group by a fixed date range and another column calculating a value column's sum using window functions?
我有一個Spark DataFrame由三列組成: Date
, Item
和Value
類型分別為Date
, String
和Double
。 我想按日期范圍分組(每個范圍的持續時間是從數據框中的第一個日期開始的7天及以上)和項目,並計算由日期范圍(實際的周數)和項目定義的每個此類組的值的總和。
我懷疑PySpark的Window函數應該在某處用於日期范圍,但在這種情況下無法弄清楚如何實現它們。
讓我們首先為此定義方法 -
(a)為行添加week_start_date列(每個日期)
(b)在group by(和'item')中使用week_start_date列並計算“value”的總和
生成一些測試數據
from pyspark.sql.types import *
schema = StructType([StructField('date', StringType(),True),
StructField('item', StringType(),True),
StructField('value', DoubleType(),True)
]
)
data = [('2019-01-01','I1',1.1),
('2019-01-02','I1',1.1),
('2019-01-10','I1',1.1),
('2019-01-10','I2',1.1),
('2019-01-11','I2',1.1),
('2019-01-11','I3',1.1)]
df = spark.createDataFrame(data, schema)
Python函數生成week_start_date
from datetime import datetime, timedelta
def week_start_date(day):
dt = datetime.strptime(day, '%Y-%m-%d')
start = dt - timedelta(days=dt.weekday())
end = start + timedelta(days=6)
return start.strftime('%Y-%m-%d')
spark.udf.register('week_start_date',week_start_date)
使用函數生成week_start_date,然后在week_start_date和item上分組
df.selectExpr("week_start_date(date) as start_date","date","item as item","value as value" ).\
groupBy("start_date","item").\
agg(sum('value').alias('value_sum')).\
orderBy("start_date").\
show()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.