简体   繁体   English

在 pyspark 中执行 partitionBy 列时,消除特定列的 null 值行

[英]Eliminate null value rows for a specific column while doing partitionBy column in pyspark

I have a pyspark dataframe like this:我有一个 pyspark dataframe 像这样:

+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null|   CT|
|222|name1|   CT|
|222|name2|   CT|
|333|name3|   CT|
|333|name4|   CT|
|333| null|   CT|
+---+-----+-----+

For a given ID, I would like to keep that record even though column "name" is null if its a ID is not repeated, but if the ID is repeated, then I would like to check on name column and make sure it does not contain duplicates within that ID, and also remove if "name" is null ONLY for repeated IDs.对于给定的 ID,我想保留该记录,即使列“名称”是 null,如果它的 ID 不重复,但如果 ID 重复,那么我想检查名称列并确保它不包含该 ID 中的重复项,如果“名称”为 null 仅用于重复 ID,则也将其删除。 Below is the desired output:下面是所需的 output:

+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null|   CT|
|222|name1|   CT|
|222|name2|   CT|
|333|name3|   CT|
|333|name4|   CT|
+---+-----+-----+

How can I achieve this in PySpark?如何在 PySpark 中实现这一点?

You can do this by grouping by the id column and count the number of names in each group.您可以通过按 id 列分组并计算每个组中的名称数来做到这一点。 Null values will be ignored by default in Spark so any group that has 0 in count should be kept.默认情况下,Spark 中会忽略 Null 值,因此应保留计数为 0 的任何组。 We can now filter away any nulls in groups with a count larger than 0.我们现在可以过滤掉计数大于 0 的组中的任何空值。

In Scala this can be done with a window function as follows:在 Scala 中,这可以使用 window function 来完成,如下所示:

val w = Window.partitionBy("id")
val df2 = df.withColumn("gCount", count($"name").over(w))
  .filter($"name".isNotNull or $"gCount" === 0)
  .drop("gCount")

The PySpark equivalent: PySpark 等效:

w = Window.partitionBy("id")
df.withColumn("gCount", count("name").over(w))
  .filter((col("name").isNotNull()) | (col("gCount") == 0))
  .drop("gCount")

The above will not remove rows that have multiple nulls for the same id (all these will be kept).上面的内容不会删除对于同一 id 具有多个 null 的行(所有这些都将被保留)。

If these should be removed as well, keeping only a single row with name==null , an easy way would be to use .dropDuplicates(['id','name']) before or after running the above code.如果这些也应该被删除,只保留一行name==null ,一个简单的方法是在运行上述代码之前或之后使用.dropDuplicates(['id','name']) Note that this also will remove any other duplicates (in which case .dropDuplicates(['id','name', 'state']) could be preferable).请注意,这也将删除任何其他重复项(在这种情况下.dropDuplicates(['id','name', 'state'])可能更可取)。

I think you can do that in two steps.我认为你可以分两步做到这一点。 First, count values by id一、按id计算值

import pyspark.sql.window as psw
w = psw.Window.partitionBy("id")
df = df.withColumn("n",psf.sum(psf.lit(1)).over(w))

Then filter to remove Null when n<1 :然后在n<1时过滤以去除Null

df.filter(!((psf.col('name').isNull()) & (psf.col('n') > 1)))

Edit编辑

As mentioned by @Shubham Jain, if you have several Null values for name (duplicates), the above filter will keep them.正如@Shubham Jain 所提到的,如果您有几个Null值作为name (重复),上述过滤器将保留它们。 In that case, the solution proposed by @Shaido is useful: add a post treatment using .dropDuplicates(['id','name']) .在这种情况下,@Shaido 提出的解决方案很有用:使用.dropDuplicates(['id','name'])添加后处理。 Or .dropDuplicates(['id','name','state']) , following your preference.dropDuplicates(['id','name','state']) ,根据您的喜好

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM