[英]pyspark variable not defined error using window function in dataframe select operation
我有示例输入 dataframe 如下,但值(以 m 开头的 clm)列可以是 n 个数字。
customer_id|field_id|month_id|m1 |m2
1001 | 10 |01 |10 |20
1002 | 20 |01 |20 |30
1003 | 30 |01 |30 |40
1001 | 10 |02 |40 |50
1002 | 20 |02 |50 |60
1003 | 30 |02 |60 |70
1001 | 10 |03 |70 |80
1002 | 20 |03 |80 |90
1003 | 30 |03 |90 |100
我必须根据 m1 和 m2 的累积和创建新列。 已经使用 windows function 来实现这一点。 但是,我遇到了一些奇怪的问题,如下所示:
代码尝试:
partiton_list = ["customer_id", "field_id"]
# Preparing the window function
window_num = (Window.partitionBy(partiton_list).orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
# Prepare the new columns expression
n1_list_expr = ["F.sum(F.col('m1')).over(window_num).alias('n1')", "F.sum(F.col('m2')).over(window_num).alias('n2')"]
#Evaluated the expression using eval to process column by column in select
new_n1_list_expr = [eval(x) for x in n1_list_expr]
#Getting column list of the source dataframe
df_col = df.columns
# Appending the new columns expression
df_col.append(new_n1_list_expr)
#Doing the select to create/calculate the new columns
df = df.select([x for x in df_col])
但是程序在 eval 语句中失败了
with below error:
File "<string>", line 1, <module>
NameError: name 'window_num' is not defined
不知道,当我单独尝试代码工作时,但是当将它作为 function 中的公共模块尝试时,代码块失败并出现上述错误。 我不明白为什么它无法通过变量找到 window ????
预期 Output:
customer_id|field_id|month_id|m1 |m2 |n1 |n2
1001 | 10 |01 |10 |20 |10 |20
1002 | 20 |01 |20 |30 |20 |30
1003 | 30 |01 |30 |40 |30 |40
1001 | 10 |02 |40 |50 |50 |70
1002 | 20 |02 |50 |60 |70 |90
1003 | 30 |02 |60 |70 |90 |110
1001 | 10 |03 |70 |80 |120 |150
1002 | 20 |03 |80 |90 |150 |180
1003 | 30 |03 |90 |100 |180 |210
您似乎在重复非常相似的问题。
这是我的尝试。 关键是表达式*list
将以list[0], list[1], ...
顺序展开列表元素。 因此,您的partition_list
是一个列表,它不能是partitionBy
参数,但您应该扩展它。
import pyspark.sql.functions as F
from pyspark.sql import Window
# Preparation of partition cols and new cols expression
partiton_list = ["customer_id", "field_id"]
n1_list_expr = [F.sum(F.col('m1')).over(window_num).alias('n1'), F.sum(F.col('m2')).over(window_num).alias('n2')]
# Merge the columns and set the window
df_col = df.columns + n1_list_expr
window_num = Window.partitionBy(*partiton_list).orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0)
# Result
df.select(*df_col).orderBy('month_id', 'field_id').show(10, False)
+-----------+--------+--------+---+---+---+---+---+---+
|customer_id|field_id|month_id|m1 |m2 |n1 |n2 |n1 |n2 |
+-----------+--------+--------+---+---+---+---+---+---+
|1001 |10 |1 |10 |20 |10 |20 |10 |20 |
|1002 |20 |1 |20 |30 |20 |30 |20 |30 |
|1003 |30 |1 |30 |40 |30 |40 |30 |40 |
|1001 |10 |2 |40 |50 |50 |70 |50 |70 |
|1002 |20 |2 |50 |60 |70 |90 |70 |90 |
|1003 |30 |2 |60 |70 |90 |110|90 |110|
|1001 |10 |3 |70 |80 |120|150|120|150|
|1002 |20 |3 |80 |90 |150|180|150|180|
|1003 |30 |3 |90 |100|180|210|180|210|
+-----------+--------+--------+---+---+---+---+---+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.