逐行更新 Spark Dataframe

Question

考慮下表/數據框：

|------------------|
|date       | value|
|------------------|
|2022-01-08 | 2    |
|2022-01-09 | 4    |
|2022-01-10 | 6    |
|2022-01-11 | 8    |
-------------------|

以及以下 SQL 查詢：

WHILE (@start_date <= @end_date)
BEGIN
    update t1 set value = 
        IIF(ISNULL(avg_value,0) < 2, 0,1)
    from #table t1
    outer apply (
        select 
            top 1 value as avg_value
        FROM 
            #table t2
        WHERE
            value >= 2 AND
            t2.date < t1.date
        ORDER BY date DESC
    ) t3
    where t1.date = @start_date
    SET @start_date = dateadd(day,1, @start_date)
END

我知道我的 output 是：

|------------------------------|
|date       | value | avg_value|
|------------------------------|
|2022-01-08 | 0     | null     |
|2022-01-09 | 0     | 0        |
|2022-01-10 | 0     | 0        |
|2022-01-11 | 0     | 0        |
|------------------------------|

該查詢為每個日期運行一次outer apply ，因此該表逐行更新。 值得一提的是，更新的值是在outer apply中檢索的。

在 Spark 中，我使用Window function從outer apply獲取值並將其存儲在輔助列中：

|-------------------------------|
|date       | value | avg_value |
|-------------------------------|
|2022-01-08 | 0     | null      |
|2022-01-09 | 4     | 2         |
|2022-01-10 | 6     | 4         |
|2022-01-11 | 8     | 6         |
|-------------------------------|

然后我使用withColumn對value列執行更新，我的 output 是：

|-------------------|
|date       | value |
|--------------------
|2022-01-08 | 0     |
|2022-01-09 | 1     |
|2022-01-10 | 1     |
|2022-01-11 | 1     |
|-------------------|

I KNOW my Spark output is different from SQL output, because SQL performs the update in each iteration, and in Spark's case I'm doing the update after all the avg_value are calculated.

我的問題是：

有沒有辦法在不使用 while 循環的情況下執行此查詢，更具體地說，有沒有辦法在 Spark 中逐行使用更新？

我原來的 DF 有大約 300K 行，由於性能原因，我避免使用循環。

Answer 1

你說，你有 300K 行。 我懷疑它們都包含不同的日期，所以我假設你有某些組。 以下是我將使用的示例 dataframe。 我有意添加了具有不同情況的組：

from pyspark.sql import functions as F, Window as W

df = spark.createDataFrame(
    [(1, '2022-01-08', 2),    # 0
     (1, '2022-01-09', 4),    # 1
     (1, '2022-01-10', 6),    # 1
     (1, '2022-01-11', 8),    # 1

     (2, '2022-01-08', 0),    # 0
     (2, '2022-01-09', 2),    # 0
     (2, '2022-01-10', 6),    # 1

     (3, '2022-01-08', 4),    # 0
     (3, '2022-01-09', 6),    # 1
     (3, '2022-01-10', 8),    # 1

     (4, '2022-01-08', 0),    # 0
     (4, '2022-01-09', 6),    # 1
     (4, '2022-01-10', None), # 0
     (4, '2022-01-11', 6)],   # 1
    ['id', 'date', 'value'])

在評論中，我提供了預期的結果。

我試圖證明： Spark 並非旨在實現循環。 幾乎任何邏輯都可以重寫為不使用循環本身。

Window功能方法

在提供的腳本中，您可以重寫邏輯以執行相同的操作，但使用更簡單的算法而不循環：window function 和條件語句。

w = W.partitionBy('id').orderBy('date')
df.withColumn(
    'value',
    F.when((F.row_number().over(w) != 1) & (F.col('value') > 2), 1).otherwise(0)
).show()
# +---+----------+-----+
# |id |date      |value|
# +---+----------+-----+
# |1  |2022-01-08|0    |
# |1  |2022-01-09|1    |
# |1  |2022-01-10|1    |
# |1  |2022-01-11|1    |
# |2  |2022-01-08|0    |
# |2  |2022-01-09|0    |
# |2  |2022-01-10|1    |
# |3  |2022-01-08|0    |
# |3  |2022-01-09|1    |
# |3  |2022-01-10|1    |
# |4  |2022-01-08|0    |
# |4  |2022-01-09|1    |
# |4  |2022-01-10|0    |
# |4  |2022-01-11|1    |
# +---+----------+-----+

高階 function aggregate中的“循環”

function aggregate接受一個數組，“循環”遍歷每個元素並返回一個值（這里，這個值也被設為數組）。

lambda function 執行array_union ，它使 arrays 的並集具有相同的模式。

df = df.groupBy('id').agg(F.array_sort(F.collect_list(F.struct('date', 'value'))).alias('a'))
df = df.withColumn(
    'a',
    F.slice(
        F.aggregate(
            'a',
            F.expr("array(struct(cast(null as string) date, 0 value))"),
            lambda acc, x: F.array_union(
                acc,
                F.array(x.withField(
                    'value',
                    F.when(F.element_at(acc, -1)['date'].isNotNull() & (x['value'] > 2), 1).otherwise(0)
                ))
            )
        ),
        2, F.size('a')
    )
)
df = df.selectExpr("id", "inline(a)")

df.show()
# +---+----------+-----+
# | id|      date|value|
# +---+----------+-----+
# |  1|2022-01-08|    0|
# |  1|2022-01-09|    1|
# |  1|2022-01-10|    1|
# |  1|2022-01-11|    1|
# |  2|2022-01-08|    0|
# |  2|2022-01-09|    0|
# |  2|2022-01-10|    1|
# |  3|2022-01-08|    0|
# |  3|2022-01-09|    1|
# |  3|2022-01-10|    1|
# |  4|2022-01-08|    0|
# |  4|2022-01-09|    1|
# |  4|2022-01-10|    0|
# |  4|2022-01-11|    1|
# +---+----------+-----+

這樣，您可以“循環”通過數組的元素。 但請注意 arrays 的大小，因為它們包含在一個集群節點中。

逐行更新 Spark Dataframe

問題描述

1 個解決方案

解決方案1
0 2022-09-12 21:15:31

逐行更新 Spark Dataframe

問題描述

1 個解決方案

解決方案1 0 2022-09-12 21:15:31

解決方案1
0 2022-09-12 21:15:31