迭代pyspark數據幀行並應用UDF

Question

我有一個如下所示的數據框：partitionCol orderCol valueCol

+--------------+----------+----------+
| partitionCol | orderCol | valueCol |
+--------------+----------+----------+
| A            | 1        | 201      |
| A            | 2        | 645      |
| A            | 3        | 302      |
| B            | 1        | 335      |
| B            | 2        | 834      |
+--------------+----------+----------+

我想通過partitionCol進行分組，然后在每個分區內迭代行，按orderCol排序並應用一些函數來根據valueCol和緩存值計算新列。 例如

def foo(col_value, cached_value):
    tmp = <some value based on a condition between col_value and cached_value>
    <update the cached_value using some logic>
    return tmp

我知道我需要通過partitionCol進行分組並應用將分別對每個chink進行操作的UDF，但是很難找到一種好方法來迭代行並應用我描述的邏輯，以獲得所需的輸出：

+--------------+----------+----------+---------------+
| partitionCol | orderCol | valueCol | calculatedCol -
+--------------+----------+----------+---------------+
| A            | 1        | 201      | C1            |
| A            | 2        | 645      | C1            |
| A            | 3        | 302      | C2            |
| B            | 1        | 335      | C1            |
| B            | 2        | 834      | C2            |
+--------------+----------+----------+---------------+

Answer 1

我認為最好的方法是在整個數據集上應用UDF：

# first, you create a struct with the order col and the valu col
df = df.withColumn("my_data", F.struct(F.col('orderCol'), F.col('valueCol'))

# then you create an array of that new column 
df = df.groupBy("partitionCol").agg(F.collect_list('my_data').alias("my_data")

# finaly, you apply your function on that array
df = df.withColumn("calculatedCol", my_udf(F.col("my_data"))

但不知道你想要做什么，這就是我所能提供的。

迭代pyspark數據幀行並應用UDF

問題描述

1 個解決方案

解決方案1
0 已采納 2019-06-25 09:00:32

迭代pyspark數據幀行並應用UDF

問題描述

1 個解決方案

解決方案1 0 已采納 2019-06-25 09:00:32

解決方案1
0 已采納 2019-06-25 09:00:32