简体   繁体   中英

Drop duplicates except null in spark

I see in pandas there is a way to dropduplicates and ignore the nulls. Drop duplicates, but ignore nulls Is there a way to drop duplicates while ignore null values(not drop those rows) in spark?

For example: I want to drop the duplicate "animal"

val columns=Array("id", "color", "animal")
val df1=sc.parallelize(Seq(
  (1, "Blue", null ), // dont drop this
  (4, "yellow", null ), // dont drop this
  (2, "Red", "Fish"),
  (5, "green", "panda"), // one panda row needs to drop
  (6, "red", "panda"), // one panda needs to drop
  (7, "Blue", "koala")
)).toDF(columns: _*)


df1.show()

val dropped = df1.dropDuplicates("animal") 

dropped.show()

I see that dropDuplicates, takes other columns. I tried that approach but it introduces another problem of not dropping duplicate animals that are not null.

Use Window method:

Following approach gives better performance compared to distinct/dropDuplicates method.

 df1.withColumn("rn",row_number().over(Window.partitionBy("animal").orderBy("animal"))).where(('rn===1 &&'animal.isNotNull)|| ('rn>=1 && 'animal.isNull)).show

+---+------+------+---+
| id| color|animal| rn|
+---+------+------+---+
|  5| green| panda|  1|
|  7|  Blue| koala|  1|
|  1|  Blue|  null|  1|
|  4|yellow|  null|  2|
|  2|   Red|  Fish|  1|
+---+------+------+---+

One approach would be as follow (I show the complete code)

val schema2 = StructType(List(StructField("id", IntegerType, true), StructField("color",StringType, true), StructField("animal",StringType, true)))
val data = sc.parallelize(Seq(
        (1, "Blue", null ), // dont drop this
        (4, "yellow", null ), // dont drop this
        (2, "Red", "Fish"),
        (5, "green", "panda"), // one panda row needs to drop
        (6, "red", "panda"), // one panda needs to drop
        (7, "Blue", "koala")
      )).map(t => Row(t._1,t._2,t._3))
val df2 = spark.createDataFrame(data, schema2)

df2.show()
/*
+---+------+------+
| id| color|animal|
+---+------+------+
|  1|  Blue|  null|
|  4|yellow|  null|
|  2|   Red|  Fish|
|  5| green| panda|
|  6|   red| panda|
|  7|  Blue| koala|
+---+------+------+
*/
// dropping duplicates except nulls
val dropped2 = df2
    .filter(r => r(2) == null)
    .union(df2.na.drop("any").dropDuplicates("animal"))

dropped2.show()
/*
+---+------+------+
| id| color|animal|
+---+------+------+
|  1|  Blue|  null|
|  4|yellow|  null|
|  2|   Red|  Fish|
|  7|  Blue| koala|
|  5| green| panda|
+---+------+------+
*/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM