I have a Spark DataFrame consisting of three columns: Date
, Item
and Value
of types Date
, String
and Double
respectively. I would like to group by date range (where every range's duration is 7 days starting from the first date in the dataframe and up) and Item, and calculate Value's sums for each such group defined by the date range (week number actually) and Item.
I suspect PySpark's Window functions should be used at some point here for date ranges but can't figure out how to implement them in this case.
Let's first define approach for this -
(a) Add week_start_date column for row (each date)
(b) Use week_start_date column in group by (along with 'item') and calculate sum of "value"
Generate some test data
from pyspark.sql.types import *
schema = StructType([StructField('date', StringType(),True),
StructField('item', StringType(),True),
StructField('value', DoubleType(),True)
]
)
data = [('2019-01-01','I1',1.1),
('2019-01-02','I1',1.1),
('2019-01-10','I1',1.1),
('2019-01-10','I2',1.1),
('2019-01-11','I2',1.1),
('2019-01-11','I3',1.1)]
df = spark.createDataFrame(data, schema)
Python function to generate week_start_date
from datetime import datetime, timedelta
def week_start_date(day):
dt = datetime.strptime(day, '%Y-%m-%d')
start = dt - timedelta(days=dt.weekday())
end = start + timedelta(days=6)
return start.strftime('%Y-%m-%d')
spark.udf.register('week_start_date',week_start_date)
Use function to generate week_start_date and then group by on week_start_date and item
df.selectExpr("week_start_date(date) as start_date","date","item as item","value as value" ).\
groupBy("start_date","item").\
agg(sum('value').alias('value_sum')).\
orderBy("start_date").\
show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.