[英]PySpark: How to group by a fixed date range and another column calculating a value column's sum using window functions?
I have a Spark DataFrame consisting of three columns: Date
, Item
and Value
of types Date
, String
and Double
respectively. 我有一个Spark DataFrame由三列组成:
Date
, Item
和Value
类型分别为Date
, String
和Double
。 I would like to group by date range (where every range's duration is 7 days starting from the first date in the dataframe and up) and Item, and calculate Value's sums for each such group defined by the date range (week number actually) and Item. 我想按日期范围分组(每个范围的持续时间是从数据框中的第一个日期开始的7天及以上)和项目,并计算由日期范围(实际的周数)和项目定义的每个此类组的值的总和。
I suspect PySpark's Window functions should be used at some point here for date ranges but can't figure out how to implement them in this case. 我怀疑PySpark的Window函数应该在某处用于日期范围,但在这种情况下无法弄清楚如何实现它们。
Let's first define approach for this - 让我们首先为此定义方法 -
(a) Add week_start_date column for row (each date) (a)为行添加week_start_date列(每个日期)
(b) Use week_start_date column in group by (along with 'item') and calculate sum of "value" (b)在group by(和'item')中使用week_start_date列并计算“value”的总和
Generate some test data 生成一些测试数据
from pyspark.sql.types import *
schema = StructType([StructField('date', StringType(),True),
StructField('item', StringType(),True),
StructField('value', DoubleType(),True)
]
)
data = [('2019-01-01','I1',1.1),
('2019-01-02','I1',1.1),
('2019-01-10','I1',1.1),
('2019-01-10','I2',1.1),
('2019-01-11','I2',1.1),
('2019-01-11','I3',1.1)]
df = spark.createDataFrame(data, schema)
Python function to generate week_start_date Python函数生成week_start_date
from datetime import datetime, timedelta
def week_start_date(day):
dt = datetime.strptime(day, '%Y-%m-%d')
start = dt - timedelta(days=dt.weekday())
end = start + timedelta(days=6)
return start.strftime('%Y-%m-%d')
spark.udf.register('week_start_date',week_start_date)
Use function to generate week_start_date and then group by on week_start_date and item 使用函数生成week_start_date,然后在week_start_date和item上分组
df.selectExpr("week_start_date(date) as start_date","date","item as item","value as value" ).\
groupBy("start_date","item").\
agg(sum('value').alias('value_sum')).\
orderBy("start_date").\
show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.