简体   繁体   English

如何在pyspark窗口分区执行自定义逻辑

[英]How to execute custom logic at pyspark window partition

I have a dataframe in the format shown below, where we will have multiple entries of DEPNAME as shown below, my requirement is to set the result = Y at the DEPNAME level if either flag_1 or flag_2 = Y, if both the flag ie flag_1 and flag_2 = N the result will be set as N as shown for DEPNAME =personnel我有一个如下所示格式的数据帧,其中我们将有多个DEPNAME条目,如下所示,我的要求是如果flag_1flag_2 = Y,则在DEPNAME级别设置result = Y,如果标志即flag_1flag_2 = N 结果将设置为 N,如DEPNAME =personnel 所示

I am able to get the desired result using joins but I am curious if we can do it using window functions as the dataset is quite huge in size.我能够使用连接获得所需的结果,但我很好奇我们是否可以使用窗口函数来做到这一点,因为数据集的大小非常大。

+---------+------+------+-+------+
|  depName|flag_1|flag_2| result |
+---------+------+------+-+------+
|    sales|    N|  Y    |  Y    |
|    sales|    N|  N    |  Y    |
|    sales|    N|  N    |  Y    |
|personnel|    N|  N    |  N    |
|personnel|    N|  N    |  N    |
|  develop|    Y|  N    |  Y    |
|  develop|    N|  N    |  Y    |
|  develop|    N|  N    |  Y    |
|  develop|    N|  N    |  Y    |
|  develop|    N|  N    |  Y    |
+---------+-----+------+ +------+

This answers the original version of the question.这回答了问题的原始版本。

This looks like a case expression:这看起来像一个case表达式:

select t.*,
       (case when flag_1 = 'Y' or flag_2 = 'Y'
             then 'Y' else 'N'
        end) as result

For the updated version:对于更新版本:

select t.*,
       max(case when flag_1 = 'Y' or flag_2 = 'Y'
                then 'Y' else 'N'
           end) over (partition by depname) as result

If you are using PySpark (since you included it in the tags) and say that your dataframe is called df , you can use如果您正在使用 PySpark(因为您将其包含在标签中)并说您的数据框名为df ,则可以使用

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy('depName')

df = df\
  .withColumn('cnt', F.sum(F.when((F.col('flag_1') == 'Y') | (F.col('flag_2') == 'Y'), 1).otherwise(0)).over(w))\
  .withColumn('result', F.when(F.col('cnt') >= 1, 'Y').otherwise('N'))

df.show()

+---------+------+------+---+------+
|  depName|flag_1|flag_2|cnt|result|
+---------+------+------+---+------+
|  develop|     Y|     N|  1|     Y|
|  develop|     N|     N|  1|     Y|
|  develop|     N|     N|  1|     Y|
|  develop|     N|     N|  1|     Y|
|  develop|     N|     N|  1|     Y|
|personnel|     N|     N|  0|     N|
|personnel|     N|     N|  0|     N|
|    sales|     N|     Y|  1|     Y|
|    sales|     N|     N|  1|     Y|
|    sales|     N|     N|  1|     Y|
+---------+------+------+---+------+

Basically, within each partition determined by depName , you count how many times the condition flag_1 == 'Y' | flag_2 == 'Y'基本上,在由depName确定的每个分区中,您计算​​条件flag_1 == 'Y' | flag_2 == 'Y' flag_1 == 'Y' | flag_2 == 'Y' occurs, and you store it in cnt for all rows of that partition. flag_1 == 'Y' | flag_2 == 'Y'发生,并将它存储在cnt用于该分区的所有行。
Then, you use a simple .when indicating with 'Y' all groups that have cnt >= 1 .然后,您使用一个简单的.when'Y'指示所有具有cnt >= 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM