[英]Create historical data from a dataframe in pyspark
I have a dataframe as follows:我有一个 dataframe 如下:
date日期 | some_quantity一些数量 |
---|---|
... ... | ... ... |
2021-01-01 2021-01-01 | 4 4 |
2021-01-02 2021-01-02 | 1 1 |
2021-01-03 2021-01-03 | 6 6 |
2021-01-04 2021-01-04 | 2 2 |
2021-01-05 2021-01-05 | 2 2 |
2021-01-06 2021-01-06 | 8 8 |
2021-01-07 2021-01-07 | 9 9 |
2021-01-08 2021-01-08 | 1 1 |
... ... | ... ... |
I would like to create the historical data for each calendar day, and in a final step do some aggregations.我想为每个日历日创建历史数据,并在最后一步进行一些聚合。 The intermediate dataframe should look like this:中间 dataframe 应如下所示:
calendar_date日历日期 | date日期 | some_quantity一些数量 |
---|---|---|
... ... | ... ... | ... ... |
2021-01-03 2021-01-03 | 2021-01-01 2021-01-01 | 4 4 |
2021-01-03 2021-01-03 | 2021-01-02 2021-01-02 | 1 1 |
2021-01-04 2021-01-04 | ... ... | ... ... |
2021-01-04 2021-01-04 | 2021-01-01 2021-01-01 | 4 4 |
2021-01-04 2021-01-04 | 2021-01-02 2021-01-02 | 1 1 |
2021-01-04 2021-01-04 | 2021-01-03 2021-01-03 | 6 6 |
2021-01-05 2021-01-05 | ... ... | ... ... |
2021-01-05 2021-01-05 | 2021-01-01 2021-01-01 | 4 4 |
2021-01-05 2021-01-05 | 2021-01-02 2021-01-02 | 1 1 |
2021-01-05 2021-01-05 | 2021-01-03 2021-01-03 | 6 6 |
2021-01-05 2021-01-05 | 2021-01-04 2021-01-04 | 2 2 |
2021-01-06 2021-01-06 | ... ... | ... ... |
2021-01-06 2021-01-06 | 2021-01-01 2021-01-01 | 4 4 |
2021-01-06 2021-01-06 | 2021-01-02 2021-01-02 | 1 1 |
2021-01-06 2021-01-06 | 2021-01-03 2021-01-03 | 6 6 |
2021-01-06 2021-01-06 | 2021-01-04 2021-01-04 | 2 2 |
2021-01-06 2021-01-06 | 2021-01-05 2021-01-05 | 2 2 |
2021-01-06 2021-01-06 | ... ... | ... ... |
With this dataframe any aggregation on the calendar date is easy (eg indicate how many quantities were sold prior to that day, average 7days, average30days, etc.).有了这个 dataframe,日历日期上的任何聚合都很容易(例如,指出当天之前售出的数量、平均 7 天、平均 30 天等)。
I tried to run a for loop of calendar dates:我尝试运行日历日期的 for 循环:
for i, date in enumerate(pd.data_range("2021-01-01","2021-03-01"):
df_output = []
df_transformed = df.where(F.col("date") < date)
df_transformed = df_transformed.withColumn("calendar_date", date)
if i == 0:
df_output = df_transformed
else:
df_output = df_output.union(df_transformed)
However, this is highly inefficient and it crashes as soon as I started including more columns.但是,这是非常低效的,并且一旦我开始包含更多列,它就会崩溃。
Is it possible to create a dataframe with calendar dates and do a join that recreated the dataframe I expect?是否可以创建带有日历日期的 dataframe 并进行重新创建 dataframe 我期望的连接?
Thanks for any help.谢谢你的帮助。
You can simply join a calendar dataframe with your main dataframe with join condition "less than":您可以简单地将日历 dataframe 与您的主要 dataframe 连接条件“小于”:
# Let's call your main dataframe as `df`
# Extracting first and last date
_, min_date, max_date = (df
.groupBy(F.lit(1))
.agg(
F.min('date').alias('min_date'),
F.max('date').alias('max_date'),
)
.first()
)
# Then create a temporary dataframe to hold all calendar dates
dates = [{'calendar_date': str(d.date())} for d in pd.date_range(min_date, max_date)]
calendar_df = spark.createDataFrame(dates)
calendar_df.show(10, False)
# +-------------+
# |calendar_date|
# +-------------+
# |2021-01-01 |
# |2021-01-02 |
# |2021-01-03 |
# |2021-01-04 |
# |2021-01-05 |
# |2021-01-06 |
# |2021-01-07 |
# |2021-01-08 |
# +-------------+
# Now you can join to build your expected dataframe, note the join condition
(calendar_df
.join(df, on=[calendar_df.calendar_date > df.date])
.show()
)
# +-------------+----------+---+
# |calendar_date| date|qty|
# +-------------+----------+---+
# | 2021-01-02|2021-01-01| 4|
# | 2021-01-03|2021-01-01| 4|
# | 2021-01-03|2021-01-02| 1|
# | 2021-01-04|2021-01-01| 4|
# | 2021-01-04|2021-01-02| 1|
# | 2021-01-04|2021-01-03| 6|
# | 2021-01-05|2021-01-01| 4|
# | 2021-01-05|2021-01-02| 1|
# | 2021-01-05|2021-01-03| 6|
# | 2021-01-05|2021-01-04| 2|
# | 2021-01-06|2021-01-01| 4|
# | 2021-01-06|2021-01-02| 1|
# | 2021-01-06|2021-01-03| 6|
# | 2021-01-06|2021-01-04| 2|
# | 2021-01-06|2021-01-05| 2|
# | 2021-01-07|2021-01-01| 4|
# | 2021-01-07|2021-01-02| 1|
# | 2021-01-07|2021-01-03| 6|
# | 2021-01-07|2021-01-04| 2|
# | 2021-01-07|2021-01-05| 2|
# +-------------+----------+---+
# only showing top 20 rows
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.