如何在Spark 1.6 UDF中过滤可为空的数组元素

Question

Consider the following DataFrame 考虑以下DataFrame

root
 |-- values: array (nullable = true)
 |    |-- element: double (containsNull = true)

with content: 内容：

+-----------+
|     values|
+-----------+
|[1.0, null]|
+-----------+

Now I want to pass thie value column to an UDF: 现在我想将value列传递给UDF：

val inspect = udf((data:Seq[Double]) => {
  data.foreach(println)
  println()
  data.foreach(d => println(d))
  println()
  data.foreach(d => println(d==null))
  ""
})

df.withColumn("dummy",inspect($"values"))

I'm really confused from the output of the above println statements: 我确实对上面的println语句的输出感到困惑：

1.0
null

1.0
0.0

false
false

My questions: 我的问题：

Why is foreach(println) not giving the same output as foreach(d=>println(d)) ? 为什么foreach(println)不能提供与foreach(d=>println(d))相同的输出？
How can the Double be null in the first println-statement, I thought scala's Double cannot be null? 该如何Double为空在第一的println语句，我认为Scala的Double不能为空？
How can I filter null values in my Seq other han filtering 0.0 which isnt really safe? 如何在我的Seq过滤空值，而不是过滤0.0 ，这不是真正的安全？ Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls? 我应该使用Seq[java.lang.Double]作为UDF中输入的类型，然后过滤null吗？ (this works, but I'm unsure if that is the way to go) （这有效，但是我不确定这是否可行）

Note that I'm aware of this Question , but my question is specific to array-type columns. 请注意，我知道这个Question ，但是我的问题特定于数组类型的列。

Answer 1

Why is foreach(println) not giving the same output as foreach(d=>println(d))? 为什么foreach（println）不能提供与foreach（d => println（d））相同的输出？

In the context where Any is expected data cast is skipped completely. 在需要Any的上下文中，将完全跳过数据转换。 This is explained in detail in If an Int can't be null, what does null.asInstanceOf[Int] mean? 如果Int不能为null，则将对此进行详细说明。null.asInstanceOf [Int]是什么意思？

How can the Double be null in the first println-statement, I thought scala's Double cannot be null? 我以为scala的Double不能为null，所以在第一个println语句中Double怎么能为null？

Internal binary representation doesn't use Scala types at all. 内部二进制表示根本不使用Scala类型。 Once array data is decoded it is represented as an Array[Any] and elements are coerced to a declared type with simple asInstanceOf . 数组数据解码后，将表示为Array[Any]并使用简单的asInstanceOf将元素强制为已声明的类型。

Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls? 我应该使用Seq [java.lang.Double]作为UDF中输入的类型，然后过滤null吗？

In general if values are nullable then you should use external type which is nullable as well or Option . 通常，如果值可以为空，那么您应该使用也可以为空的外部类型或Option 。 Unfortunately only the first option is applicable for UDFs. 不幸的是，只有第一个选项适用于UDF。

如何在Spark 1.6 UDF中过滤可为空的数组元素

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-02-26 14:28:16

如何在Spark 1.6 UDF中过滤可为空的数组元素

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-02-26 14:28:16

解决方案1
1 已采纳 2017-02-26 14:28:16