简体   繁体   中英

Spark DataFrame - drop null values from column

Given a dataframe :

    val df = sc.parallelize(Seq(("foo", ArrayBuffer(null,"bar",null)), ("bar", ArrayBuffer("one","two",null)))).toDF("key", "value")
    df.show

    +---+--------------------------+
    |key|                     value|
    +---+--------------------------+
    |foo|ArrayBuffer(null,bar,null)|
    |bar|ArrayBuffer(one, two,null)|
    +---+--------------------------+

I'd like to drop null from column value . After removal the dataframe should look like this :

    +---+--------------------------+
    |key|                     value|
    +---+--------------------------+
    |foo|ArrayBuffer(bar)          |
    |bar|ArrayBuffer(one, two)     |
    +---+--------------------------+

Any suggestion welcome . 10x

You'll need an UDF here. For example with a flatMap :

val filterOutNull = udf((xs: Seq[String]) => 
  Option(xs).map(_.flatMap(Option(_))))

df.withColumn("value", filterOutNull($"value"))

where external Option with map handles NULL columns:

Option(null: Seq[String]).map(identity)
Option[Seq[String]] = None
Option(Seq("foo", null, "bar")).map(identity)
Option[Seq[String]] = Some(List(foo, null, bar))

and ensures we don't fail with NPE when input is NULL / null by mapping

NULL -> null -> None -> None -> NULL

where null is a Scala null and NULL is a SQL NULL .

The internal flatMap flattens a sequence of Options effectively filtering nulls :

Seq("foo", null, "bar").flatMap(Option(_))
Seq[String] = List(foo, bar)

A more imperative equivalent could be something like this:

val imperativeFilterOutNull = udf((xs: Seq[String]) => 
  if (xs == null) xs
  else for {
    x <- xs
    if x != null
  } yield x)

Option 1: using UDF:

 val filterNull = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
 df.withColumn("value", filterNull($"value")).show()

Option 2: no UDF

df.withColumn("value", explode($"value")).filter($"value".isNotNull).groupBy("key").agg(collect_list($"value")).show()

Note that this is less efficient...

Also you can use spark-daria it has: com.github.mrpowers.spark.daria.sql.functions.arrayExNull

from the documentation:

Like array but doesn't include null element

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM