[英]How to filter out rows with lots of conditions in pyspark?
Let's say that these are my data:假设这些是我的数据:
` Product_Number| Condition| Type | Country
1 | New | Chainsaw | USA
1 | Old | Chainsaw | USA
1 | Null | Chainsaw | USA
2 | Old | Tractor | India
3 | Null | Machete | Colombia
4 | New | Shovel | Brazil
5 | New | Fertilizer | Italy
5 | Old | Fertilizer | Italy `
The problem is that sometimes, there are more than one Product_Number while it should be unique.问题是有时,有多个 Product_Number 而它应该是唯一的。 What I am trying to do is from the ones that are in the dataframe more than once, to take the ones whose Condition is New without touching the rest.我要做的是不止一次地从 dataframe 中的那些中取出条件是新的而不触及 rest 的那些。 That gets to the result:这得到了结果:
` Product_Number| Condition| Type | Country
1 | New | Chainsaw | USA
2 | Old | Tractor | India
3 | Null | Machete | Colombia
4 | New | Shovel | Brazil
5 | New | Fertilizer | Italy`
What I tried to do is first to see how many distinct product numbers I have:我试图做的是首先查看我有多少不同的产品编号:
df.select('Product_Number').distinct().count()
Then identify the product numbers that exist most than once and put them in a list:然后找出出现次数最多的产品编号,并将它们放在一个列表中:
numbers = df.select('Product_Number').groupBy('Product_Number').count().where('count > 1')\
.select('Product_Number').rdd.flatMap(lambda x: x).collect()
Then I am trying to filter out the product numbers that exist more than once and the Condition isn't new.然后我试图过滤掉多次存在并且条件不是新的产品编号。 By filtering them out, if it is done perfectly, counting it should give the same number as df.select('Product_Number').distinct().count().通过过滤掉它们,如果它做得很好,计算它应该给出与 df.select('Product_Number').distinct().count() 相同的数字。
The code that I have tried is:我尝试过的代码是:
1) df.filter(~(df.Product_Number.isin(numbers)) & ~((df.Condition == 'Old') | (df.Condition.isNull())))
1) df.filter(~(df.Product_Number.isin(numbers)) & ~((df.Condition == 'Old') | (df.Condition.isNull())))
df.filter(~((df.Product_Number.isin(numbers)) & ((df.Condition == 'Old') | (df.Condition.isNull()))))
df.filter(~(df.Product_Number.isin(numbers)) & (df.Condition == 'New'))
However, I haven't succeeded until now.然而,直到现在我还没有成功。
You conditions should be你的条件应该是
(Product_Number is in numbers AND Condition == New) OR
(Product_Number is not in numbers)
So, this is the correct filter condition.所以,这是正确的过滤条件。
df.filter((df.Product_Number.isin(numbers) & (df.Condition == 'New'))
| (~df.Product_Number.isin(numbers)))
However, collect
can be a heavy operation if you have large dataset and you can rewrite your code without collect
.但是,如果您有大型数据集并且可以在不使用collect
的情况下重写代码,则collect
可能是一项繁重的操作。
from pyspark.sql import functions as F
w = Window.partitionBy('Product_Number')
df = (df.withColumn('cnt', F.count('*').over(w))
.filter(((F.col('Condition') == 'New') & (F.col('cnt') > 1)) | (F.col('cnt') == 1))
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.