Pyspark - 根据一行中的条件过滤掉多行

Question

I have a table like so:我有一张这样的桌子：

--------------------------------------------
| Id  |  Value   | Some Other Columns Here
| 0   |  5       |
| 0   |  4       |
| 0   |  0       |
| 1   |  3       |
| 2   |  1       |
| 2   |  8       |
| 3   |  -4      |
--------------------------------------------

I would like to remove all IDs which have any Value <= 0, so the result would be:我想删除所有具有任何值 <= 0 的 ID，因此结果将是：

--------------------------------------------
| Id  |  Value   | Some Other Columns Here
| 1   |  3       |
| 2   |  1       |
| 2   |  8       |
--------------------------------------------

I tried doing this by filtering to only rows with Value<=0, selecting the distinct IDs from this, converting that to a list, and then removing any rows in the original table that have an ID in that list using df.filter(~df.Id.isin(mylist))我尝试通过仅过滤到 Value<=0 的行，从中选择不同的 ID，将其转换为列表，然后使用df.filter(~df.Id.isin(mylist))

However, I have a huge amount of data, and this ran out of memory making the list, so I need to come up with a pure pyspark solution.但是，我有大量的数据，这用完了 memory 上榜，所以我需要拿出一个纯 pyspark 的解决方案。

Answer 1

You can use window functions:您可以使用 window 函数：

select t.*
from (select t.*, min(value) over (partition by id) as min_value
      from t
     ) t
where min_value > 0

Answer 2

As Gordon mentions, you may need a window for this, here is a pyspark version:正如 Gordon 所提到的，您可能需要一个 window，这是一个 pyspark 版本：

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy("Id")
(df.withColumn("flag",F.when(F.col("Value")<=0,0).otherwise(1))
   .withColumn("Min",F.min("flag").over(w)).filter(F.col("Min")!=0)
   .drop("flag","Min")).show()

+---+-----+
| Id|Value|
+---+-----+
|  1|    3|
|  2|    1|
|  2|    8|
+---+-----+

Brief summary of approach taken:所采取方法的简要总结：

Set a flag when Value<=0 then 0 else `1当Value<=0 then 0 else `1 时设置一个标志
get min over a partition of id (will return 0 if any of prev cond is met)在 id 的分区上获取 min （如果满足任何 prev cond 将返回 0）
filter only when this Min value is not 0仅当此Min值不为 0 时过滤

` `

Pyspark - 根据一行中的条件过滤掉多行

问题描述

2 个解决方案

解决方案1
4 2020-07-21 16:55:50

解决方案2
3 已采纳 2020-07-21 17:09:02

Pyspark - 根据一行中的条件过滤掉多行

问题描述

2 个解决方案

解决方案1 4 2020-07-21 16:55:50

解决方案2 3 已采纳 2020-07-21 17:09:02

解决方案1
4 2020-07-21 16:55:50

解决方案2
3 已采纳 2020-07-21 17:09:02