如何在pyspark窗口分区执行自定义逻辑

Question

I have a dataframe in the format shown below, where we will have multiple entries of DEPNAME as shown below, my requirement is to set the result = Y at the DEPNAME level if either flag_1 or flag_2 = Y, if both the flag ie flag_1 and flag_2 = N the result will be set as N as shown for DEPNAME =personnel我有一个如下所示格式的数据帧，其中我们将有多个DEPNAME条目，如下所示，我的要求是如果flag_1或flag_2 = Y，则在DEPNAME级别设置result = Y，如果标志即flag_1和flag_2 = N 结果将设置为 N，如DEPNAME =personnel 所示

I am able to get the desired result using joins but I am curious if we can do it using window functions as the dataset is quite huge in size.我能够使用连接获得所需的结果，但我很好奇我们是否可以使用窗口函数来做到这一点，因为数据集的大小非常大。

+---------+------+------+-+------+
|  depName|flag_1|flag_2| result |
+---------+------+------+-+------+
|    sales|    N|  Y    |  Y    |
|    sales|    N|  N    |  Y    |
|    sales|    N|  N    |  Y    |
|personnel|    N|  N    |  N    |
|personnel|    N|  N    |  N    |
|  develop|    Y|  N    |  Y    |
|  develop|    N|  N    |  Y    |
|  develop|    N|  N    |  Y    |
|  develop|    N|  N    |  Y    |
|  develop|    N|  N    |  Y    |
+---------+-----+------+ +------+

Answer 1

This answers the original version of the question.这回答了问题的原始版本。

This looks like a case expression:这看起来像一个case表达式：

select t.*,
       (case when flag_1 = 'Y' or flag_2 = 'Y'
             then 'Y' else 'N'
        end) as result

For the updated version:对于更新版本：

select t.*,
       max(case when flag_1 = 'Y' or flag_2 = 'Y'
                then 'Y' else 'N'
           end) over (partition by depname) as result

Answer 2

If you are using PySpark (since you included it in the tags) and say that your dataframe is called df , you can use如果您正在使用 PySpark（因为您将其包含在标签中）并说您的数据框名为df ，则可以使用

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy('depName')

df = df\
  .withColumn('cnt', F.sum(F.when((F.col('flag_1') == 'Y') | (F.col('flag_2') == 'Y'), 1).otherwise(0)).over(w))\
  .withColumn('result', F.when(F.col('cnt') >= 1, 'Y').otherwise('N'))

df.show()

+---------+------+------+---+------+
|  depName|flag_1|flag_2|cnt|result|
+---------+------+------+---+------+
|  develop|     Y|     N|  1|     Y|
|  develop|     N|     N|  1|     Y|
|  develop|     N|     N|  1|     Y|
|  develop|     N|     N|  1|     Y|
|  develop|     N|     N|  1|     Y|
|personnel|     N|     N|  0|     N|
|personnel|     N|     N|  0|     N|
|    sales|     N|     Y|  1|     Y|
|    sales|     N|     N|  1|     Y|
|    sales|     N|     N|  1|     Y|
+---------+------+------+---+------+

Basically, within each partition determined by depName , you count how many times the condition flag_1 == 'Y' | flag_2 == 'Y'基本上，在由depName确定的每个分区中，您计算条件flag_1 == 'Y' | flag_2 == 'Y' flag_1 == 'Y' | flag_2 == 'Y' occurs, and you store it in cnt for all rows of that partition. flag_1 == 'Y' | flag_2 == 'Y'发生，并将它存储在cnt用于该分区的所有行。
Then, you use a simple .when indicating with 'Y' all groups that have cnt >= 1 .然后，您使用一个简单的.when用'Y'指示所有具有cnt >= 1 。

如何在pyspark窗口分区执行自定义逻辑

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-07-27 12:58:54

解决方案2
1 2021-07-27 13:06:39

如何在pyspark窗口分区执行自定义逻辑

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-07-27 12:58:54

解决方案2 1 2021-07-27 13:06:39

解决方案1
1 已采纳 2021-07-27 12:58:54

解决方案2
1 2021-07-27 13:06:39