I see in pandas there is a way to dropduplicates and ignore the nulls. Drop duplicates, but ignore nulls Is there a way to drop duplicates while ignore null values(not drop those rows) in spark?
For example: I want to drop the duplicate "animal"
val columns=Array("id", "color", "animal")
val df1=sc.parallelize(Seq(
(1, "Blue", null ), // dont drop this
(4, "yellow", null ), // dont drop this
(2, "Red", "Fish"),
(5, "green", "panda"), // one panda row needs to drop
(6, "red", "panda"), // one panda needs to drop
(7, "Blue", "koala")
)).toDF(columns: _*)
df1.show()
val dropped = df1.dropDuplicates("animal")
dropped.show()
I see that dropDuplicates, takes other columns. I tried that approach but it introduces another problem of not dropping duplicate animals that are not null.
Use Window method:
Following approach gives better performance compared to distinct/dropDuplicates method.
df1.withColumn("rn",row_number().over(Window.partitionBy("animal").orderBy("animal"))).where(('rn===1 &&'animal.isNotNull)|| ('rn>=1 && 'animal.isNull)).show
+---+------+------+---+
| id| color|animal| rn|
+---+------+------+---+
| 5| green| panda| 1|
| 7| Blue| koala| 1|
| 1| Blue| null| 1|
| 4|yellow| null| 2|
| 2| Red| Fish| 1|
+---+------+------+---+
One approach would be as follow (I show the complete code)
val schema2 = StructType(List(StructField("id", IntegerType, true), StructField("color",StringType, true), StructField("animal",StringType, true)))
val data = sc.parallelize(Seq(
(1, "Blue", null ), // dont drop this
(4, "yellow", null ), // dont drop this
(2, "Red", "Fish"),
(5, "green", "panda"), // one panda row needs to drop
(6, "red", "panda"), // one panda needs to drop
(7, "Blue", "koala")
)).map(t => Row(t._1,t._2,t._3))
val df2 = spark.createDataFrame(data, schema2)
df2.show()
/*
+---+------+------+
| id| color|animal|
+---+------+------+
| 1| Blue| null|
| 4|yellow| null|
| 2| Red| Fish|
| 5| green| panda|
| 6| red| panda|
| 7| Blue| koala|
+---+------+------+
*/
// dropping duplicates except nulls
val dropped2 = df2
.filter(r => r(2) == null)
.union(df2.na.drop("any").dropDuplicates("animal"))
dropped2.show()
/*
+---+------+------+
| id| color|animal|
+---+------+------+
| 1| Blue| null|
| 4|yellow| null|
| 2| Red| Fish|
| 7| Blue| koala|
| 5| green| panda|
+---+------+------+
*/
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.