简体   繁体   English

PySpark:如何按固定日期范围分组,另一列使用窗口函数计算值列的总和?

[英]PySpark: How to group by a fixed date range and another column calculating a value column's sum using window functions?

I have a Spark DataFrame consisting of three columns: Date , Item and Value of types Date , String and Double respectively. 我有一个Spark DataFrame由三列组成: DateItemValue类型分别为DateStringDouble I would like to group by date range (where every range's duration is 7 days starting from the first date in the dataframe and up) and Item, and calculate Value's sums for each such group defined by the date range (week number actually) and Item. 我想按日期范围分组(每个范围的持续时间是从数据框中的第一个日期开始的7天及以上)和项目,并计算由日期范围(实际的周数)和项目定义的每个此类组的值的总和。

I suspect PySpark's Window functions should be used at some point here for date ranges but can't figure out how to implement them in this case. 我怀疑PySpark的Window函数应该在某处用于日期范围,但在这种情况下无法弄清楚如何实现它们。

Let's first define approach for this - 让我们首先为此定义方法 -

(a) Add week_start_date column for row (each date) (a)为行添加week_start_date列(每个日期)

(b) Use week_start_date column in group by (along with 'item') and calculate sum of "value" (b)在group by(和'item')中使用week_start_date列并计算“value”的总和

Generate some test data 生成一些测试数据

from pyspark.sql.types import *

schema = StructType([StructField('date', StringType(),True),
                     StructField('item', StringType(),True),
                     StructField('value', DoubleType(),True)
    ]
    )

data = [('2019-01-01','I1',1.1),
        ('2019-01-02','I1',1.1),
        ('2019-01-10','I1',1.1),
        ('2019-01-10','I2',1.1),
        ('2019-01-11','I2',1.1),
        ('2019-01-11','I3',1.1)]

df = spark.createDataFrame(data, schema)

Python function to generate week_start_date Python函数生成week_start_date

from datetime import datetime, timedelta

def week_start_date(day):
    dt = datetime.strptime(day, '%Y-%m-%d')
    start = dt - timedelta(days=dt.weekday())
    end = start + timedelta(days=6)
    return start.strftime('%Y-%m-%d')

spark.udf.register('week_start_date',week_start_date)

Use function to generate week_start_date and then group by on week_start_date and item 使用函数生成week_start_date,然后在week_start_date和item上分组

 df.selectExpr("week_start_date(date) as start_date","date","item as item","value as value" ).\
        groupBy("start_date","item").\
        agg(sum('value').alias('value_sum')).\
        orderBy("start_date").\
        show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用窗口函数计算PySpark中的累积和 - Calculating Cumulative sum in PySpark using Window Functions Python - Pandas:创建新列,该列是另一个列的组以日期列为条件的总和 - Python - Pandas: Create new column that is the aggregate sum of another column's group conditional on a date column 如何根据另一列中的日期值范围创建排名列? - How to create a ranking column based on date value range in another column? 如何获取 pyspark 中日期列的最大值 - How to get the max value of date column in pyspark 如何使用 Pandas 根据另一列的值将两列相加 - How to sum two columns together based on another column's value using Pandas 使用 Django 查询计算过滤列值的总和时出错 - Error in Calculating the sum of a filtered column value using Django Queries 如何对一列的值求和并将它们按另一列分组 - How to sum values of one column and group them by another column 如何用另一列的总和与同一列的前一个值填充一列? - How to fill a column with the sum of another column and the previous value of the same column? pyspark如何根据另一列的值返回一列的平均值? - pyspark how to return the average of a column based on the value of another column? 如何根据 pyspark 中另一列的值检查一列是否为 null? - How to check if a column is null based on value of another column in pyspark?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM