简体   繁体   English

如何过滤掉pyspark中条件多的行?

[英]How to filter out rows with lots of conditions in pyspark?

Let's say that these are my data:假设这些是我的数据:

` Product_Number| Condition|   Type     | Country
           1    |  New     | Chainsaw   | USA
           1    |  Old     | Chainsaw   | USA
           1    |  Null    | Chainsaw   | USA
           2    |  Old     | Tractor    | India
           3    |  Null    | Machete    | Colombia
           4    |  New     | Shovel     | Brazil
           5    |  New     | Fertilizer | Italy
           5    |  Old     | Fertilizer | Italy `

The problem is that sometimes, there are more than one Product_Number while it should be unique.问题是有时,有多个 Product_Number 而它应该是唯一的。 What I am trying to do is from the ones that are in the dataframe more than once, to take the ones whose Condition is New without touching the rest.我要做的是不止一次地从 dataframe 中的那些中取出条件是新的而不触及 rest 的那些。 That gets to the result:这得到了结果:

` Product_Number| Condition|   Type     | Country
           1    |  New     | Chainsaw   | USA
           2    |  Old     | Tractor    | India
           3    |  Null    | Machete    | Colombia
           4    |  New     | Shovel     | Brazil
           5    |  New     | Fertilizer | Italy`

What I tried to do is first to see how many distinct product numbers I have:我试图做的是首先查看我有多少不同的产品编号:

df.select('Product_Number').distinct().count()

Then identify the product numbers that exist most than once and put them in a list:然后找出出现次数最多的产品编号,并将它们放在一个列表中:

numbers = df.select('Product_Number').groupBy('Product_Number').count().where('count > 1')\
                   .select('Product_Number').rdd.flatMap(lambda x: x).collect()

Then I am trying to filter out the product numbers that exist more than once and the Condition isn't new.然后我试图过滤掉多次存在并且条件不是新的产品编号。 By filtering them out, if it is done perfectly, counting it should give the same number as df.select('Product_Number').distinct().count().通过过滤掉它们,如果它做得很好,计算它应该给出与 df.select('Product_Number').distinct().count() 相同的数字。

The code that I have tried is:我尝试过的代码是:

1) df.filter(~(df.Product_Number.isin(numbers)) & ~((df.Condition == 'Old') | (df.Condition.isNull()))) 1) df.filter(~(df.Product_Number.isin(numbers)) & ~((df.Condition == 'Old') | (df.Condition.isNull())))

  1. df.filter(~((df.Product_Number.isin(numbers)) & ((df.Condition == 'Old') | (df.Condition.isNull()))))

  2. df.filter(~(df.Product_Number.isin(numbers)) & (df.Condition == 'New'))

However, I haven't succeeded until now.然而,直到现在我还没有成功。

You conditions should be你的条件应该是

(Product_Number is in numbers AND Condition == New) OR 
(Product_Number is not in numbers)

So, this is the correct filter condition.所以,这是正确的过滤条件。

df.filter((df.Product_Number.isin(numbers) & (df.Condition == 'New')) 
| (~df.Product_Number.isin(numbers)))

However, collect can be a heavy operation if you have large dataset and you can rewrite your code without collect .但是,如果您有大型数据集并且可以在不使用collect的情况下重写代码,则collect可能是一项繁重的操作。

from pyspark.sql import functions as F
w = Window.partitionBy('Product_Number')
df = (df.withColumn('cnt', F.count('*').over(w))
 .filter(((F.col('Condition') == 'New') & (F.col('cnt') > 1)) | (F.col('cnt') == 1))
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何过滤掉 pyspark 中符合正则表达式的行? - How to filter out rows in pyspark that conform to a regular expression? PySpark:如何过滤掉特定条件之前的行 - PySpark: How to filter out rows before a specific condition 如何在 Dataframe、Pyspark 中更新具有多个条件的行 - How to update rows with many conditions in Dataframe, Pyspark 如何过滤掉不以数字开头的行(CSV、PySpark)。 已编辑:仅包含数字 - How to filter out the rows that do not start with digit (CSV, PySpark). Edited: Contain only with number PySpark - 如何根据列中的两个值从数据框中过滤出连续的行块 - PySpark - How to filter a consecutive chunk of rows out of a dataframe based on two values in a column 如何根据不同 dataframe 的列值从 pyspark dataframe 中过滤掉行 - How to filter out rows from pyspark dataframe based on values of columns of different dataframe 如何根据窗口和pyspark中的条件过滤行? - How to filter rows based on window and a condition in pyspark? Pyspark - 根据一行中的条件过滤掉多行 - Pyspark - filter out multiple rows based on a condition in one row 过滤掉 pyspark dataframe 中 1 个月前的行 - filter out rows from pyspark dataframe that are 1 month ago Pyspark 复合过滤器,多种条件 - Pyspark compound filter, multiple conditions
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM