Spark DataFrame为数据集归零

Question

When importing data from a MS SQL database, there is the potential for null values. 从MS SQL数据库导入数据时，可能会出现空值。 In Spark, DataFrames are able to handle the null values. 在Spark中，DataFrames可以处理空值。 But when I try to convert the DataFrame to a strongly typed Dataset, I receive encoder errors. 但是，当我尝试将DataFrame转换为强类型的Dataset时，会收到编码器错误。

Here's a simple example: 这是一个简单的例子：

case class optionTest(var a: Option[Int], var b: Option[Int])

object testObject {
  def main(args: Array[String]): Unit = {
    import spark.implicits._
    val df = spark.sparkContext.parallelize(Seq(input)).toDF()

    val df2 = Seq((1, 3), (3, Option(null)))
                 .toDF("a", "b")
                 .as[optionTest]

    df2.show()
  }
}

Here is the error for this case: 这是这种情况下的错误：

No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"
java.lang.UnsupportedOperationException: No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"

What is the recommended approach to handle nullable values when creating a Dataset from a DataFrame? 从DataFrame创建数据集时，推荐的处理空值的方法是什么？

Answer 1

The problem is that your Dataframe doesn't match your case class. 问题在于您的数据框与案例类不匹配。

Your first pair is an (Int, Int) , and your second is an (Int, Option[Null]) . 您的第一对是(Int, Int) ，第二对是(Int, Option[Null]) 。

The easy thing to note is that if you want to represent an Option[Int] , the value will be either Some(3) , for example, or None for an absent value. 容易注意到的是，如果要表示Option[Int] ，则该值将为Some(3) ，而对于缺少的值将为None 。

The tricky thing to note is that in Scala Int is a subclass of AnyVal while nullable references, which should be almost nonexistent in the Scala code you write, are on the AnyRef side of the Scala object hierarchy. 需要注意的棘手的事情是，在Scala中， Int是AnyVal的子类，而在您编写的Scala代码中应该几乎不存在的可空引用位于Scala对象层次结构的AnyRef端。

Because you have a bunch of objects that are all over the Scala object model, Spark has to treat your data as Any , the superclass of everything. 由于您在Scala对象模型中遍布一堆对象，因此Spark必须将您的数据视为Any ，这是一切的超类。 There is no encoder that can handle that. 没有编码器可以处理该问题。

So with all that said, your data would have to look like this: 所以说了这么多，您的数据就必须像这样：

val df2 = Seq((Some(1), Some(3)), (Some(3), None))

As a side note, your case class should look like this: 作为附带说明，您的案例类应如下所示：

case class OptionTest(a: Option[Int], b: Option[Int])

Answer 2

If you wan to use Option you have to use it for all records. 如果要使用Option ，则必须将其用于所有记录。 You should also use None instead of Option(null) : 您还应该使用None而不是Option(null) ：

Seq((1, Some(3)), (3, None)).toDF("a", "b").as[optionTest]

Spark DataFrame为数据集归零

问题描述

2 个解决方案

解决方案1
1 2017-03-31 00:40:45

解决方案2
0 2017-03-31 00:08:28

Spark DataFrame为数据集归零

问题描述

2 个解决方案

解决方案1 1 2017-03-31 00:40:45

解决方案2 0 2017-03-31 00:08:28

解决方案1
1 2017-03-31 00:40:45

解决方案2
0 2017-03-31 00:08:28