簡體   English   中英

pyspark variable not defined error using window function in dataframe select operation

[英]pyspark variable not defined error using window function in dataframe select operation

我有示例輸入 dataframe 如下,但值(以 m 開頭的 clm)列可以是 n 個數字。

customer_id|field_id|month_id|m1  |m2
1001       |  10    |01      |10  |20    
1002       |  20    |01      |20  |30    
1003       |  30    |01      |30  |40
1001       |  10    |02      |40  |50    
1002       |  20    |02      |50  |60    
1003       |  30    |02      |60  |70
1001       |  10    |03      |70  |80    
1002       |  20    |03      |80  |90    
1003       |  30    |03      |90  |100

我必須根據 m1 和 m2 的累積和創建新列。 已經使用 windows function 來實現這一點。 但是,我遇到了一些奇怪的問題,如下所示:

代碼嘗試:

partiton_list = ["customer_id", "field_id"]
# Preparing the window function
window_num = (Window.partitionBy(partiton_list).orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
# Prepare the new columns expression
n1_list_expr = ["F.sum(F.col('m1')).over(window_num).alias('n1')", "F.sum(F.col('m2')).over(window_num).alias('n2')"]
#Evaluated the expression using eval to process column by column in select
new_n1_list_expr = [eval(x) for x in n1_list_expr]
#Getting column list of the source dataframe
df_col = df.columns
# Appending the new columns expression
df_col.append(new_n1_list_expr)
#Doing the select to create/calculate the new columns
df = df.select([x for x in df_col])

但是程序在 eval 語句中失敗了

with below error:
File "<string>", line 1, <module>
NameError: name 'window_num' is not defined

不知道,當我單獨嘗試代碼工作時,但是當將它作為 function 中的公共模塊嘗試時,代碼塊失敗並出現上述錯誤。 我不明白為什么它無法通過變量找到 window ????

預期 Output:

customer_id|field_id|month_id|m1     |m2    |n1   |n2  
1001       |  10    |01      |10     |20    |10   |20  
1002       |  20    |01      |20     |30    |20   |30  
1003       |  30    |01      |30     |40    |30   |40  
1001       |  10    |02      |40     |50    |50   |70  
1002       |  20    |02      |50     |60    |70   |90
1003       |  30    |02      |60     |70    |90   |110  
1001       |  10    |03      |70     |80    |120  |150
1002       |  20    |03      |80     |90    |150  |180
1003       |  30    |03      |90     |100   |180  |210

您似乎在重復非常相似的問題。

這是我的嘗試。 關鍵是表達式*list將以list[0], list[1], ...順序展開列表元素。 因此,您的partition_list是一個列表,它不能是partitionBy參數,但您應該擴展它。

import pyspark.sql.functions as F
from pyspark.sql import Window

# Preparation of partition cols and new cols expression
partiton_list = ["customer_id", "field_id"]
n1_list_expr = [F.sum(F.col('m1')).over(window_num).alias('n1'), F.sum(F.col('m2')).over(window_num).alias('n2')]

# Merge the columns and set the window
df_col = df.columns + n1_list_expr
window_num = Window.partitionBy(*partiton_list).orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0)

# Result
df.select(*df_col).orderBy('month_id', 'field_id').show(10, False)

+-----------+--------+--------+---+---+---+---+---+---+
|customer_id|field_id|month_id|m1 |m2 |n1 |n2 |n1 |n2 |
+-----------+--------+--------+---+---+---+---+---+---+
|1001       |10      |1       |10 |20 |10 |20 |10 |20 |
|1002       |20      |1       |20 |30 |20 |30 |20 |30 |
|1003       |30      |1       |30 |40 |30 |40 |30 |40 |
|1001       |10      |2       |40 |50 |50 |70 |50 |70 |
|1002       |20      |2       |50 |60 |70 |90 |70 |90 |
|1003       |30      |2       |60 |70 |90 |110|90 |110|
|1001       |10      |3       |70 |80 |120|150|120|150|
|1002       |20      |3       |80 |90 |150|180|150|180|
|1003       |30      |3       |90 |100|180|210|180|210|
+-----------+--------+--------+---+---+---+---+---+---+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM