PySpark：避免自加入的组上个月的最后一个值

Question

I would like to obtain the last value an attribute takes per group over the previous month.我想获得上个月每个组的属性所采用的最后一个值。

I can achieve this with a self-join like so:我可以通过这样的自加入来实现这一点：

from pyspark.sql import SparkSession, functions as F
from pyspark.sql.window import Window

df = (
    spark.createDataFrame(
    [
        ("2022-07-29", 1, 1),
        ("2022-07-30", 1, 2),
        ("2022-07-31", 1, 3),
        ("2022-08-01", 1, 4),
        ("2022-08-02", 1, 5),
        ("2022-08-03", 1, 6), 
        ("2022-09-10", 1, 8),
        ("2022-09-11", 1, 9),
        ("2022-09-12", 1, 10), 
        ("2022-07-29", 2, 7),
        ("2022-07-30", 2, 6),
        ("2022-07-31", 2, 5),
        ("2022-08-01", 2, 4),
        ("2022-08-02", 2, 3),
        ("2022-08-03", 2, 2),  
        ("2022-09-10", 2, 8),
        ("2022-09-11", 2, 9),
        ("2022-09-12", 2, 10), 
    ],
            ["date","id","value"]
    )
    .withColumn("date", F.to_date(F.col("date")))
)

w = Window.partitionBy("id", "month").orderBy(F.col("date").desc())
df = (
    df
    .withColumn("month", F.date_trunc("month", F.col("date")))
    .join(
        df
        .withColumn("month", F.add_months(F.date_trunc("month", F.col("date")), 1))
        .withColumn("last_value_prev_month", F.first(F.col("value")).over(w))
        .select("id", "month", "last_value_prev_month")
        .drop_duplicates(subset=["id", "month"]),
        on=["id", "month"],
        how="left"
    )
    .drop("month")
    .orderBy(["id", "date"])
)
df.show()

+---+----------+-----+---------------------+
| id|      date|value|last_value_prev_month|
+---+----------+-----+---------------------+
|  1|2022-07-29|    1|                 null|
|  1|2022-07-30|    2|                 null|
|  1|2022-07-31|    3|                 null|
|  1|2022-08-01|    4|                    3|
|  1|2022-08-02|    5|                    3|
|  1|2022-08-03|    6|                    3|
|  1|2022-09-10|    8|                    6|
|  1|2022-09-11|    9|                    6|
|  1|2022-09-12|   10|                    6|
|  2|2022-07-29|    7|                 null|
|  2|2022-07-30|    6|                 null|
|  2|2022-07-31|    5|                 null|
|  2|2022-08-01|    4|                    5|
|  2|2022-08-02|    3|                    5|
|  2|2022-08-03|    2|                    5|
|  2|2022-09-10|    8|                    2|
|  2|2022-09-11|    9|                    2|
|  2|2022-09-12|   10|                    2|
+---+----------+-----+---------------------+

This seems inefficient to me.这对我来说似乎效率低下。

Can this be done with just a window, avoiding a self-join?这可以仅使用 window 来完成，避免自连接吗？

Answer 1

samkart has provided the main idea of the answer above. samkart提供了上述答案的主要思想。 Here I provide a solution with two instead of three windows.这里我提供一个解决方案，用两个而不是三个 windows。

days = lambda x: x * 86400
w1 = (
    Window
    .partitionBy("id")
    .orderBy(F.col("date").cast("timestamp").cast("long"))
    .rangeBetween(Window.unboundedPreceding, -days(1))
)
w2 = (
    Window.
    partitionBy("id", F.date_trunc("month", "date"))
    .orderBy(F.col("date"))
)

(
    df
    .withColumn("value_prev_day", F.last("value").over(w1))
    .withColumn("last_value_prev_month", F.first("value_prev_day").over(w2))
    .orderBy(["id", "date"])
    .show()
)

+----------+---+-----+--------------+---------------------+
|      date| id|value|value_prev_day|last_value_prev_month|
+----------+---+-----+--------------+---------------------+
|2022-07-29|  1|    1|          null|                 null|
|2022-07-30|  1|    2|             1|                 null|
|2022-07-31|  1|    3|             2|                 null|
|2022-08-01|  1|    4|             3|                    3|
|2022-08-02|  1|    5|             4|                    3|
|2022-08-03|  1|    6|             5|                    3|
|2022-09-10|  1|    8|             6|                    6|
|2022-09-11|  1|    9|             8|                    6|
|2022-09-12|  1|   10|             9|                    6|
|2022-07-29|  2|    7|          null|                 null|
|2022-07-30|  2|    6|             7|                 null|
|2022-07-31|  2|    5|             6|                 null|
|2022-08-01|  2|    4|             5|                    5|
|2022-08-02|  2|    3|             4|                    5|
|2022-08-03|  2|    2|             3|                    5|
|2022-09-10|  2|    8|             2|                    2|
|2022-09-11|  2|    9|             8|                    2|
|2022-09-12|  2|   10|             9|                    2|
+----------+---+-----+--------------+---------------------+

value_prev_day is value of value on the previous day (per id ) value_prev_day是前一天的value （每个id ）
Once we have this, we can create another partition of the data, by id and the month of the date for the current row.一旦我们有了这个，我们就可以创建另一个数据分区，按id和当前行的date月份。 We then order this partition by date , meaning that the first of the month is the first row in the partition.然后我们按date对该分区进行排序，这意味着该月的第一天是分区中的first行。 We assign last_value_prev_month as first(value_prev_day) over this partition.我们将last_value_prev_month分配为该分区上的first(value_prev_day) 。 This has to be last value of the previous month, since it is the value_prev_day of the first of the month.这必须是上个月的最后一个值，因为它是该月第一天的value_prev_day 。

Answer 2

Yes, we can do it using window functions to avoid a join.是的，我们可以使用 window 函数来避免连接。

data_sdf. \
    withColumn('mth', func.month('date')). \
    withColumn('blah', 
               (func.col('mth') != func.lag('mth').over(wd.partitionBy('id').orderBy('date'))).cast('int')
               ). \
    withColumn('blah2', 
               func.when(func.col('blah') == 1, 
                         func.lag('value').over(wd.partitionBy('id').orderBy('date'))
                         )
               ). \
    withColumn('last_value_prev_month', 
               func.last('blah2', ignorenulls=True).over(wd.partitionBy('id').orderBy('date'))
               ). \
    drop('mth', 'blah', 'blah2'). \
    show()

# +----------+---+-----+---------------------+
# |      date| id|value|last_value_prev_month|
# +----------+---+-----+---------------------+
# |2022-07-29|  1|    1|                 null|
# |2022-07-30|  1|    2|                 null|
# |2022-07-31|  1|    3|                 null|
# |2022-08-01|  1|    4|                    3|
# |2022-08-02|  1|    5|                    3|
# |2022-08-03|  1|    6|                    3|
# |2022-09-10|  1|    8|                    6|
# |2022-09-11|  1|    9|                    6|
# |2022-09-12|  1|   10|                    6|
# |2022-07-29|  2|    7|                 null|
# |2022-07-30|  2|    6|                 null|
# |2022-07-31|  2|    5|                 null|
# |2022-08-01|  2|    4|                    5|
# |2022-08-02|  2|    3|                    5|
# |2022-08-03|  2|    2|                    5|
# |2022-09-10|  2|    8|                    2|
# |2022-09-11|  2|    9|                    2|
# |2022-09-12|  2|   10|                    2|
# +----------+---+-----+---------------------+

blah flags the record of the first date in a month. blah标记了一个月内第一次约会的记录。
blah2 sets the lag of value for the aforementioned record. blah2为上述记录设置了value滞后。 ie value in the last date of the previous month.即上个月最后一天的value 。
use last() window function on the aforementioned blah2 field to fill the nulls.在上述blah2字段上使用last() window function 来填充空值。

PySpark：避免自加入的组上个月的最后一个值

问题描述

2 个解决方案

解决方案1
1 2022-09-14 15:30:14

解决方案2
0 已采纳 2022-09-14 13:47:19

PySpark：避免自加入的组上个月的最后一个值

问题描述

2 个解决方案

解决方案1 1 2022-09-14 15:30:14

解决方案2 0 已采纳 2022-09-14 13:47:19

解决方案1
1 2022-09-14 15:30:14

解决方案2
0 已采纳 2022-09-14 13:47:19