![](/img/trans.png)
[英]PySpark: How to apply a Python UDF to PySpark DataFrame columns?
[英]Iterate pyspark dataframe rows and apply UDF
我有一個如下所示的數據框:partitionCol orderCol valueCol
+--------------+----------+----------+
| partitionCol | orderCol | valueCol |
+--------------+----------+----------+
| A | 1 | 201 |
| A | 2 | 645 |
| A | 3 | 302 |
| B | 1 | 335 |
| B | 2 | 834 |
+--------------+----------+----------+
我想通過partitionCol進行分組,然后在每個分區內迭代行,按orderCol排序並應用一些函數來根據valueCol和緩存值計算新列。 例如
def foo(col_value, cached_value):
tmp = <some value based on a condition between col_value and cached_value>
<update the cached_value using some logic>
return tmp
我知道我需要通過partitionCol進行分組並應用將分別對每個chink進行操作的UDF,但是很難找到一種好方法來迭代行並應用我描述的邏輯,以獲得所需的輸出:
+--------------+----------+----------+---------------+
| partitionCol | orderCol | valueCol | calculatedCol -
+--------------+----------+----------+---------------+
| A | 1 | 201 | C1 |
| A | 2 | 645 | C1 |
| A | 3 | 302 | C2 |
| B | 1 | 335 | C1 |
| B | 2 | 834 | C2 |
+--------------+----------+----------+---------------+
我認為最好的方法是在整個數據集上應用UDF:
# first, you create a struct with the order col and the valu col
df = df.withColumn("my_data", F.struct(F.col('orderCol'), F.col('valueCol'))
# then you create an array of that new column
df = df.groupBy("partitionCol").agg(F.collect_list('my_data').alias("my_data")
# finaly, you apply your function on that array
df = df.withColumn("calculatedCol", my_udf(F.col("my_data"))
但不知道你想要做什么,這就是我所能提供的。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.