[英]PySpark: Last value of previous month by group avoiding self-join
I would like to obtain the last value an attribute takes per group over the previous month.我想获得上个月每个组的属性所采用的最后一个值。
I can achieve this with a self-join like so:我可以通过这样的自加入来实现这一点:
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.window import Window
df = (
spark.createDataFrame(
[
("2022-07-29", 1, 1),
("2022-07-30", 1, 2),
("2022-07-31", 1, 3),
("2022-08-01", 1, 4),
("2022-08-02", 1, 5),
("2022-08-03", 1, 6),
("2022-09-10", 1, 8),
("2022-09-11", 1, 9),
("2022-09-12", 1, 10),
("2022-07-29", 2, 7),
("2022-07-30", 2, 6),
("2022-07-31", 2, 5),
("2022-08-01", 2, 4),
("2022-08-02", 2, 3),
("2022-08-03", 2, 2),
("2022-09-10", 2, 8),
("2022-09-11", 2, 9),
("2022-09-12", 2, 10),
],
["date","id","value"]
)
.withColumn("date", F.to_date(F.col("date")))
)
w = Window.partitionBy("id", "month").orderBy(F.col("date").desc())
df = (
df
.withColumn("month", F.date_trunc("month", F.col("date")))
.join(
df
.withColumn("month", F.add_months(F.date_trunc("month", F.col("date")), 1))
.withColumn("last_value_prev_month", F.first(F.col("value")).over(w))
.select("id", "month", "last_value_prev_month")
.drop_duplicates(subset=["id", "month"]),
on=["id", "month"],
how="left"
)
.drop("month")
.orderBy(["id", "date"])
)
df.show()
+---+----------+-----+---------------------+
| id| date|value|last_value_prev_month|
+---+----------+-----+---------------------+
| 1|2022-07-29| 1| null|
| 1|2022-07-30| 2| null|
| 1|2022-07-31| 3| null|
| 1|2022-08-01| 4| 3|
| 1|2022-08-02| 5| 3|
| 1|2022-08-03| 6| 3|
| 1|2022-09-10| 8| 6|
| 1|2022-09-11| 9| 6|
| 1|2022-09-12| 10| 6|
| 2|2022-07-29| 7| null|
| 2|2022-07-30| 6| null|
| 2|2022-07-31| 5| null|
| 2|2022-08-01| 4| 5|
| 2|2022-08-02| 3| 5|
| 2|2022-08-03| 2| 5|
| 2|2022-09-10| 8| 2|
| 2|2022-09-11| 9| 2|
| 2|2022-09-12| 10| 2|
+---+----------+-----+---------------------+
This seems inefficient to me.这对我来说似乎效率低下。
Can this be done with just a window, avoiding a self-join?这可以仅使用 window 来完成,避免自连接吗?
samkart has provided the main idea of the answer above. samkart提供了上述答案的主要思想。 Here I provide a solution with two instead of three windows.
这里我提供一个解决方案,用两个而不是三个 windows。
days = lambda x: x * 86400
w1 = (
Window
.partitionBy("id")
.orderBy(F.col("date").cast("timestamp").cast("long"))
.rangeBetween(Window.unboundedPreceding, -days(1))
)
w2 = (
Window.
partitionBy("id", F.date_trunc("month", "date"))
.orderBy(F.col("date"))
)
(
df
.withColumn("value_prev_day", F.last("value").over(w1))
.withColumn("last_value_prev_month", F.first("value_prev_day").over(w2))
.orderBy(["id", "date"])
.show()
)
+----------+---+-----+--------------+---------------------+
| date| id|value|value_prev_day|last_value_prev_month|
+----------+---+-----+--------------+---------------------+
|2022-07-29| 1| 1| null| null|
|2022-07-30| 1| 2| 1| null|
|2022-07-31| 1| 3| 2| null|
|2022-08-01| 1| 4| 3| 3|
|2022-08-02| 1| 5| 4| 3|
|2022-08-03| 1| 6| 5| 3|
|2022-09-10| 1| 8| 6| 6|
|2022-09-11| 1| 9| 8| 6|
|2022-09-12| 1| 10| 9| 6|
|2022-07-29| 2| 7| null| null|
|2022-07-30| 2| 6| 7| null|
|2022-07-31| 2| 5| 6| null|
|2022-08-01| 2| 4| 5| 5|
|2022-08-02| 2| 3| 4| 5|
|2022-08-03| 2| 2| 3| 5|
|2022-09-10| 2| 8| 2| 2|
|2022-09-11| 2| 9| 8| 2|
|2022-09-12| 2| 10| 9| 2|
+----------+---+-----+--------------+---------------------+
value_prev_day
is value of value
on the previous day (per id
) value_prev_day
是前一天的value
(每个id
)id
and the month of the date
for the current row.id
和当前行的date
月份。 We then order this partition by date
, meaning that the first of the month is the first
row in the partition.date
对该分区进行排序,这意味着该月的第一天是分区中的first
行。 We assign last_value_prev_month
as first(value_prev_day)
over this partition.last_value_prev_month
分配为该分区上的first(value_prev_day)
。 This has to be last value of the previous month, since it is the value_prev_day
of the first of the month.value_prev_day
。Yes, we can do it using window functions to avoid a join.是的,我们可以使用 window 函数来避免连接。
data_sdf. \
withColumn('mth', func.month('date')). \
withColumn('blah',
(func.col('mth') != func.lag('mth').over(wd.partitionBy('id').orderBy('date'))).cast('int')
). \
withColumn('blah2',
func.when(func.col('blah') == 1,
func.lag('value').over(wd.partitionBy('id').orderBy('date'))
)
). \
withColumn('last_value_prev_month',
func.last('blah2', ignorenulls=True).over(wd.partitionBy('id').orderBy('date'))
). \
drop('mth', 'blah', 'blah2'). \
show()
# +----------+---+-----+---------------------+
# | date| id|value|last_value_prev_month|
# +----------+---+-----+---------------------+
# |2022-07-29| 1| 1| null|
# |2022-07-30| 1| 2| null|
# |2022-07-31| 1| 3| null|
# |2022-08-01| 1| 4| 3|
# |2022-08-02| 1| 5| 3|
# |2022-08-03| 1| 6| 3|
# |2022-09-10| 1| 8| 6|
# |2022-09-11| 1| 9| 6|
# |2022-09-12| 1| 10| 6|
# |2022-07-29| 2| 7| null|
# |2022-07-30| 2| 6| null|
# |2022-07-31| 2| 5| null|
# |2022-08-01| 2| 4| 5|
# |2022-08-02| 2| 3| 5|
# |2022-08-03| 2| 2| 5|
# |2022-09-10| 2| 8| 2|
# |2022-09-11| 2| 9| 2|
# |2022-09-12| 2| 10| 2|
# +----------+---+-----+---------------------+
blah
flags the record of the first date in a month. blah
标记了一个月内第一次约会的记录。blah2
sets the lag of value
for the aforementioned record. blah2
为上述记录设置了value
滞后。 ie value
in the last date of the previous month.value
。last()
window function on the aforementioned blah2
field to fill the nulls.blah2
字段上使用last()
window function 来填充空值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.