I have a table like so:
--------------------------------------------
| Id | Value | Some Other Columns Here
| 0 | 5 |
| 0 | 4 |
| 0 | 0 |
| 1 | 3 |
| 2 | 1 |
| 2 | 8 |
| 3 | -4 |
--------------------------------------------
I would like to remove all IDs which have any Value <= 0, so the result would be:
--------------------------------------------
| Id | Value | Some Other Columns Here
| 1 | 3 |
| 2 | 1 |
| 2 | 8 |
--------------------------------------------
I tried doing this by filtering to only rows with Value<=0, selecting the distinct IDs from this, converting that to a list, and then removing any rows in the original table that have an ID in that list using df.filter(~df.Id.isin(mylist))
However, I have a huge amount of data, and this ran out of memory making the list, so I need to come up with a pure pyspark solution.
You can use window functions:
select t.*
from (select t.*, min(value) over (partition by id) as min_value
from t
) t
where min_value > 0
As Gordon mentions, you may need a window for this, here is a pyspark version:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy("Id")
(df.withColumn("flag",F.when(F.col("Value")<=0,0).otherwise(1))
.withColumn("Min",F.min("flag").over(w)).filter(F.col("Min")!=0)
.drop("flag","Min")).show()
+---+-----+
| Id|Value|
+---+-----+
| 1| 3|
| 2| 1|
| 2| 8|
+---+-----+
Brief summary of approach taken:
Value<=0
then 0
else `1Min
value is not 0`
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.