[英]How to filter nullable Array-Elements in Spark 1.6 UDF
Consider the following DataFrame 考虑以下DataFrame
root
|-- values: array (nullable = true)
| |-- element: double (containsNull = true)
with content: 内容:
+-----------+
| values|
+-----------+
|[1.0, null]|
+-----------+
Now I want to pass thie value
column to an UDF: 现在我想将value
列传递给UDF:
val inspect = udf((data:Seq[Double]) => {
data.foreach(println)
println()
data.foreach(d => println(d))
println()
data.foreach(d => println(d==null))
""
})
df.withColumn("dummy",inspect($"values"))
I'm really confused from the output of the above println
statements: 我确实对上面的println
语句的输出感到困惑:
1.0
null
1.0
0.0
false
false
My questions: 我的问题:
foreach(println)
not giving the same output as foreach(d=>println(d))
? 为什么foreach(println)
不能提供与foreach(d=>println(d))
相同的输出? Double
be null in the first println-statement, I thought scala's Double
cannot be null? 该如何Double
为空在第一的println语句,我认为Scala的Double
不能为空? Seq
other han filtering 0.0
which isnt really safe? 如何在我的Seq
过滤空值,而不是过滤0.0
,这不是真正的安全? Should I use Seq[java.lang.Double]
as type for my input in the UDF and then filter nulls? 我应该使用Seq[java.lang.Double]
作为UDF中输入的类型,然后过滤null吗? (this works, but I'm unsure if that is the way to go) (这有效,但是我不确定这是否可行) Note that I'm aware of this Question , but my question is specific to array-type columns. 请注意,我知道这个Question ,但是我的问题特定于数组类型的列。
Why is foreach(println) not giving the same output as foreach(d=>println(d))? 为什么foreach(println)不能提供与foreach(d => println(d))相同的输出?
In the context where Any
is expected data cast is skipped completely. 在需要Any
的上下文中,将完全跳过数据转换。 This is explained in detail in If an Int can't be null, what does null.asInstanceOf[Int] mean? 如果Int不能为null,则将对此进行详细说明。null.asInstanceOf [Int]是什么意思?
How can the Double be null in the first println-statement, I thought scala's Double cannot be null? 我以为scala的Double不能为null,所以在第一个println语句中Double怎么能为null?
Internal binary representation doesn't use Scala types at all. 内部二进制表示根本不使用Scala类型。 Once array data is decoded it is represented as an Array[Any]
and elements are coerced to a declared type with simple asInstanceOf
. 数组数据解码后,将表示为Array[Any]
并使用简单的asInstanceOf
将元素强制为已声明的类型。
Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls? 我应该使用Seq [java.lang.Double]作为UDF中输入的类型,然后过滤null吗?
In general if values are nullable then you should use external type which is nullable as well or Option
. 通常,如果值可以为空,那么您应该使用也可以为空的外部类型或Option
。 Unfortunately only the first option is applicable for UDFs. 不幸的是,只有第一个选项适用于UDF。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.