简体   繁体   中英

How to filter nullable Array-Elements in Spark 1.6 UDF

Consider the following DataFrame

root
 |-- values: array (nullable = true)
 |    |-- element: double (containsNull = true)

with content:

+-----------+
|     values|
+-----------+
|[1.0, null]|
+-----------+

Now I want to pass thie value column to an UDF:

val inspect = udf((data:Seq[Double]) => {
  data.foreach(println)
  println()
  data.foreach(d => println(d))
  println()
  data.foreach(d => println(d==null))
  ""
})

df.withColumn("dummy",inspect($"values"))

I'm really confused from the output of the above println statements:

1.0
null

1.0
0.0

false
false

My questions:

  1. Why is foreach(println) not giving the same output as foreach(d=>println(d)) ?
  2. How can the Double be null in the first println-statement, I thought scala's Double cannot be null?
  3. How can I filter null values in my Seq other han filtering 0.0 which isnt really safe? Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls? (this works, but I'm unsure if that is the way to go)

Note that I'm aware of this Question , but my question is specific to array-type columns.

Why is foreach(println) not giving the same output as foreach(d=>println(d))?

In the context where Any is expected data cast is skipped completely. This is explained in detail in If an Int can't be null, what does null.asInstanceOf[Int] mean?

How can the Double be null in the first println-statement, I thought scala's Double cannot be null?

Internal binary representation doesn't use Scala types at all. Once array data is decoded it is represented as an Array[Any] and elements are coerced to a declared type with simple asInstanceOf .

Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls?

In general if values are nullable then you should use external type which is nullable as well or Option . Unfortunately only the first option is applicable for UDFs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM