[英]Spark DataFrame nulls to Dataset
When importing data from a MS SQL database, there is the potential for null values. 从MS SQL数据库导入数据时,可能会出现空值。 In Spark, DataFrames are able to handle the null values.
在Spark中,DataFrames可以处理空值。 But when I try to convert the DataFrame to a strongly typed Dataset, I receive encoder errors.
但是,当我尝试将DataFrame转换为强类型的Dataset时,会收到编码器错误。
Here's a simple example: 这是一个简单的例子:
case class optionTest(var a: Option[Int], var b: Option[Int])
object testObject {
def main(args: Array[String]): Unit = {
import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(input)).toDF()
val df2 = Seq((1, 3), (3, Option(null)))
.toDF("a", "b")
.as[optionTest]
df2.show()
}
}
Here is the error for this case: 这是这种情况下的错误:
No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"
java.lang.UnsupportedOperationException: No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"
What is the recommended approach to handle nullable values when creating a Dataset from a DataFrame? 从DataFrame创建数据集时,推荐的处理空值的方法是什么?
The problem is that your Dataframe doesn't match your case class. 问题在于您的数据框与案例类不匹配。
Your first pair is an (Int, Int)
, and your second is an (Int, Option[Null])
. 您的第一对是
(Int, Int)
,第二对是(Int, Option[Null])
。
The easy thing to note is that if you want to represent an Option[Int]
, the value will be either Some(3)
, for example, or None
for an absent value. 容易注意到的是,如果要表示
Option[Int]
,则该值将为Some(3)
,而对于缺少的值将为None
。
The tricky thing to note is that in Scala Int
is a subclass of AnyVal
while nullable references, which should be almost nonexistent in the Scala code you write, are on the AnyRef
side of the Scala object hierarchy. 需要注意的棘手的事情是,在Scala中,
Int
是AnyVal
的子类,而在您编写的Scala代码中应该几乎不存在的可空引用位于Scala对象层次结构的AnyRef
端。
Because you have a bunch of objects that are all over the Scala object model, Spark has to treat your data as Any
, the superclass of everything. 由于您在Scala对象模型中遍布一堆对象,因此Spark必须将您的数据视为
Any
,这是一切的超类。 There is no encoder that can handle that. 没有编码器可以处理该问题。
So with all that said, your data would have to look like this: 所以说了这么多,您的数据就必须像这样:
val df2 = Seq((Some(1), Some(3)), (Some(3), None))
As a side note, your case class should look like this: 作为附带说明,您的案例类应如下所示:
case class OptionTest(a: Option[Int], b: Option[Int])
If you wan to use Option
you have to use it for all records. 如果要使用
Option
,则必须将其用于所有记录。 You should also use None
instead of Option(null)
: 您还应该使用
None
而不是Option(null)
:
Seq((1, Some(3)), (3, None)).toDF("a", "b").as[optionTest]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.