[英]How to execute custom logic at pyspark window partition
I have a dataframe in the format shown below, where we will have multiple entries of DEPNAME
as shown below, my requirement is to set the result
= Y at the DEPNAME
level if either flag_1
or flag_2
= Y, if both the flag ie flag_1
and flag_2
= N the result will be set as N as shown for DEPNAME
=personnel我有一个如下所示格式的数据帧,其中我们将有多个
DEPNAME
条目,如下所示,我的要求是如果flag_1
或flag_2
= Y,则在DEPNAME
级别设置result
= Y,如果标志即flag_1
和flag_2
= N 结果将设置为 N,如DEPNAME
=personnel 所示
I am able to get the desired result using joins but I am curious if we can do it using window functions as the dataset is quite huge in size.我能够使用连接获得所需的结果,但我很好奇我们是否可以使用窗口函数来做到这一点,因为数据集的大小非常大。
+---------+------+------+-+------+
| depName|flag_1|flag_2| result |
+---------+------+------+-+------+
| sales| N| Y | Y |
| sales| N| N | Y |
| sales| N| N | Y |
|personnel| N| N | N |
|personnel| N| N | N |
| develop| Y| N | Y |
| develop| N| N | Y |
| develop| N| N | Y |
| develop| N| N | Y |
| develop| N| N | Y |
+---------+-----+------+ +------+
This answers the original version of the question.这回答了问题的原始版本。
This looks like a case
expression:这看起来像一个
case
表达式:
select t.*,
(case when flag_1 = 'Y' or flag_2 = 'Y'
then 'Y' else 'N'
end) as result
For the updated version:对于更新版本:
select t.*,
max(case when flag_1 = 'Y' or flag_2 = 'Y'
then 'Y' else 'N'
end) over (partition by depname) as result
If you are using PySpark (since you included it in the tags) and say that your dataframe is called df
, you can use如果您正在使用 PySpark(因为您将其包含在标签中)并说您的数据框名为
df
,则可以使用
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy('depName')
df = df\
.withColumn('cnt', F.sum(F.when((F.col('flag_1') == 'Y') | (F.col('flag_2') == 'Y'), 1).otherwise(0)).over(w))\
.withColumn('result', F.when(F.col('cnt') >= 1, 'Y').otherwise('N'))
df.show()
+---------+------+------+---+------+
| depName|flag_1|flag_2|cnt|result|
+---------+------+------+---+------+
| develop| Y| N| 1| Y|
| develop| N| N| 1| Y|
| develop| N| N| 1| Y|
| develop| N| N| 1| Y|
| develop| N| N| 1| Y|
|personnel| N| N| 0| N|
|personnel| N| N| 0| N|
| sales| N| Y| 1| Y|
| sales| N| N| 1| Y|
| sales| N| N| 1| Y|
+---------+------+------+---+------+
Basically, within each partition determined by depName
, you count how many times the condition flag_1 == 'Y' | flag_2 == 'Y'
基本上,在由
depName
确定的每个分区中,您计算条件flag_1 == 'Y' | flag_2 == 'Y'
flag_1 == 'Y' | flag_2 == 'Y'
occurs, and you store it in cnt
for all rows of that partition. flag_1 == 'Y' | flag_2 == 'Y'
发生,并将它存储在cnt
用于该分区的所有行。
Then, you use a simple .when
indicating with 'Y'
all groups that have cnt >= 1
.然后,您使用一个简单的
.when
用'Y'
指示所有具有cnt >= 1
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.