Given a dataframe :
val df = sc.parallelize(Seq(("foo", ArrayBuffer(null,"bar",null)), ("bar", ArrayBuffer("one","two",null)))).toDF("key", "value")
df.show
+---+--------------------------+
|key| value|
+---+--------------------------+
|foo|ArrayBuffer(null,bar,null)|
|bar|ArrayBuffer(one, two,null)|
+---+--------------------------+
I'd like to drop null
from column value
. After removal the dataframe should look like this :
+---+--------------------------+
|key| value|
+---+--------------------------+
|foo|ArrayBuffer(bar) |
|bar|ArrayBuffer(one, two) |
+---+--------------------------+
Any suggestion welcome . 10x
You'll need an UDF here. For example with a flatMap
:
val filterOutNull = udf((xs: Seq[String]) =>
Option(xs).map(_.flatMap(Option(_))))
df.withColumn("value", filterOutNull($"value"))
where external Option
with map
handles NULL
columns:
Option(null: Seq[String]).map(identity)
Option[Seq[String]] = None
Option(Seq("foo", null, "bar")).map(identity)
Option[Seq[String]] = Some(List(foo, null, bar))
and ensures we don't fail with NPE when input is NULL
/ null
by mapping
NULL -> null -> None -> None -> NULL
where null
is a Scala null
and NULL
is a SQL NULL
.
The internal flatMap
flattens a sequence of Options
effectively filtering nulls
:
Seq("foo", null, "bar").flatMap(Option(_))
Seq[String] = List(foo, bar)
A more imperative equivalent could be something like this:
val imperativeFilterOutNull = udf((xs: Seq[String]) =>
if (xs == null) xs
else for {
x <- xs
if x != null
} yield x)
Option 1: using UDF:
val filterNull = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
df.withColumn("value", filterNull($"value")).show()
Option 2: no UDF
df.withColumn("value", explode($"value")).filter($"value".isNotNull).groupBy("key").agg(collect_list($"value")).show()
Note that this is less efficient...
Also you can use spark-daria it has: com.github.mrpowers.spark.daria.sql.functions.arrayExNull
from the documentation:
Like array but doesn't include null element
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.