OOM 使用 Spark window function 间隔 30 天

Question

I have this data frame:我有这个数据框：

df = (
spark
.createDataFrame([
    [20210101, 'A', 103, "abc"], 
    [20210101, 'A', 102, "def"], 
    [20210101, 'A', 101, "def"], 
    [20210102, 'A', 34, "ghu"], 
    [20210101, 'B', 180, "xyz"], 
    [20210102, 'B', 123, "kqt"]
]
    ).toDF("txn_date", "txn_type", "txn_amount", "other_attributes")
)

Each date has multiple transactions of each of the different types.每个日期都有多个不同类型的交易。 My task is to compute the standard deviation of the amount for each record (for the same type and going back 30 days).我的任务是计算每条记录的金额标准差（对于相同的类型并返回 30 天）。

The most obvious approach (that I tried) is to create a window based on type and include records going back to past 30 days.最明显的方法（我尝试过）是根据类型创建一个 window 并包括可以追溯到过去 30 天的记录。

days = lambda i: i * 86400
win = Window.partitionBy("txn_type").orderBy(F.col("txn_date").cast(LongType())).rangeBetween(-days(30), 0)
df = df.withColumn("stddev_last_30days", F.stddev(F.col("txn_amount")).over(win))

Since some of the transaction types have millions of transactions per day, this runs into OOM.由于某些交易类型每天有数百万笔交易，这会导致 OOM。

I tried doing it in parts (take only few records for each date at a time) but this leads to error prone calculations since standard deviation is not additive.我尝试分部分进行（一次只为每个日期记录少量记录），但这会导致计算容易出错，因为标准偏差不是相加的。

I also tried 'collect_set' for all records for a transaction type and date (so all amounts come in as an array in one column), but this runs into OOM as well.我还为交易类型和日期的所有记录尝试了“collect_set”（因此所有金额都以数组形式出现在一个列中），但这也会遇到 OOM。

I tried processing one month at a time (I need at a minimum 2 months data since I need to go back 1 month) but even that overwhelms my executors.我尝试一次处理一个月（我需要至少 2 个月的数据，因为我需要 go 回到 1 个月）但即使这样也让我的执行者不知所措。

What would be a scalable way to solve this problem?解决此问题的可扩展方法是什么？

Notes:笔记：

In the original data, column txn_date is stored as long in "yyyyMMdd" format.在原始数据中，列txn_date以“yyyyMMdd”格式存储。
There are other columns in the data frame that may or may not be same for each date and type.数据框中的其他列对于每个日期和类型可能相同也可能不同。 I haven't included them in the sample code for simplicity.为简单起见，我没有将它们包含在示例代码中。

Answer 1

Filtering过滤

It's always good to remove data which is not needed.删除不需要的数据总是好的。 You said you need just last 60 days, so You could filter out what's not needed.你说你只需要最后 60 天，所以你可以filter掉不需要的东西。
This line would keep only rows with date not older than 60 last days (until today):此行将仅保留日期不早于最后几天（直到今天）的行：

df = df.filter(F.to_date('txn_date', 'yyyyMMdd').between(F.current_date()-61, F.current_date()))

I'll not use it now in order to illustrate other issues.我现在不会使用它来说明其他问题。

Window Window

The first simple thing, if it's already in long format, you don't need to cast to long again, so we can remove .cast(LongType()) .第一个简单的事情，如果它已经是长格式，你不需要再次转换为 long，所以我们可以删除.cast(LongType()) 。

The other, big thing , is that your window's lower bound is wrong.另一件大事是你的窗口的下限是错误的。 Look, let's add one more line to the input:看，让我们在输入中再添加一行：

[19990101, 'B', 9999999, "xxxxxxx"],

The line represents the date from the year 1999. After the line was added, running the code, we get this:该行表示 1999 年的日期。添加该行后，运行代码，我们得到：

# +--------+--------+----------+----------------+------------------+
# |txn_date|txn_type|txn_amount|other_attributes|stddev_last_30days|
# +--------+--------+----------+----------------+------------------+
# |20210101|       A|       103|             abc|               1.0|
# |20210101|       A|       102|             def|               1.0|
# |20210101|       A|       101|             def|               1.0|
# |20210102|       A|        34|             ghu|34.009802508492555|
# |19990101|       B|   9999999|         xxxxxxx|              null|
# |20210101|       B|       180|             xyz|  7070939.82553808|
# |20210102|       B|       123|             kqt|  5773414.64605055|
# +--------+--------+----------+----------------+------------------+

You can see that stddev for 2021 year lines was also affected, so 30 day window does not work, your window actually takes all the data it can.您可以看到 2021 年行的 stddev 也受到了影响，因此 30 天 window 不起作用，您的 window 实际上可以获取所有数据。 We can check what is the lower bound for date 20210101 :我们可以检查日期20210101的下限是多少：

print(20210101-days(30))  # Returns 17618101 - I doubt you wanted this date as lower bound

Probably this was your biggest problem.可能这是你最大的问题。 You should never try to outsmart dates and times.你永远不应该试图超越日期和时间。 Always use functions specialized for dates and times.始终使用专门用于日期和时间的函数。

You can use this window:您可以使用这个 window：

days = lambda i: i * 86400
w = Window.partitionBy('txn_type').orderBy(F.unix_timestamp(F.col('txn_date').cast('string'), 'yyyyMMdd')).rangeBetween(-days(30), 0)
df = df.withColumn('stddev_last_30days', F.stddev('txn_amount').over(w))

df.show()
# +--------+--------+----------+----------------+------------------+
# |txn_date|txn_type|txn_amount|other_attributes|stddev_last_30days|
# +--------+--------+----------+----------------+------------------+
# |20210101|       A|       103|             abc|               1.0|
# |20210101|       A|       102|             def|               1.0|
# |20210101|       A|       101|             def|               1.0|
# |20210102|       A|        34|             ghu|34.009802508492555|
# |19990101|       B|   9999999|         xxxxxxx|              null|
# |20210101|       B|       180|             xyz|              null|
# |20210102|       B|       123|             kqt| 40.30508652763321|
# +--------+--------+----------+----------------+------------------+

unix_timestamp can transform your 'yyyyMMdd' format into a proper long-format number (UNIX time in seconds). unix_timestamp可以将您的 'yyyyMMdd' 格式转换为适当的长格式数字（UNIX 时间以秒为单位）。 From this, now you can subtract seconds (30 days worth of seconds).从此，现在您可以减去秒数（相当于 30 天的秒数）。

OOM 使用 Spark window function 间隔 30 天

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-07-02 17:38:01

OOM 使用 Spark window function 间隔 30 天

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-07-02 17:38:01

解决方案1
1 已采纳 2022-07-02 17:38:01