简体   繁体   English

Spark DataFrame为数据集归零

[英]Spark DataFrame nulls to Dataset

When importing data from a MS SQL database, there is the potential for null values. 从MS SQL数据库导入数据时,可能会出现空值。 In Spark, DataFrames are able to handle the null values. 在Spark中,DataFrames可以处理空值。 But when I try to convert the DataFrame to a strongly typed Dataset, I receive encoder errors. 但是,当我尝试将DataFrame转换为强类型的Dataset时,会收到编码器错误。

Here's a simple example: 这是一个简单的例子:

case class optionTest(var a: Option[Int], var b: Option[Int])

object testObject {
  def main(args: Array[String]): Unit = {
    import spark.implicits._
    val df = spark.sparkContext.parallelize(Seq(input)).toDF()

    val df2 = Seq((1, 3), (3, Option(null)))
                 .toDF("a", "b")
                 .as[optionTest]

    df2.show()
  }
}

Here is the error for this case: 这是这种情况下的错误:

No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"
java.lang.UnsupportedOperationException: No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"

What is the recommended approach to handle nullable values when creating a Dataset from a DataFrame? 从DataFrame创建数据集时,推荐的处理空值的方法是什么?

The problem is that your Dataframe doesn't match your case class. 问题在于您的数据框与案例类不匹配。

Your first pair is an (Int, Int) , and your second is an (Int, Option[Null]) . 您的第一对是(Int, Int) ,第二对是(Int, Option[Null])

The easy thing to note is that if you want to represent an Option[Int] , the value will be either Some(3) , for example, or None for an absent value. 容易注意到的是,如果要表示Option[Int] ,则该值将为Some(3) ,而对于缺少的值将为None

The tricky thing to note is that in Scala Int is a subclass of AnyVal while nullable references, which should be almost nonexistent in the Scala code you write, are on the AnyRef side of the Scala object hierarchy. 需要注意的棘手的事情是,在Scala中, IntAnyVal的子类,而在您编写的Scala代码中应该几乎不存在的可空引用位于Scala对象层次结构的AnyRef端。

Because you have a bunch of objects that are all over the Scala object model, Spark has to treat your data as Any , the superclass of everything. 由于您在Scala对象模型中遍布一堆对象,因此Spark必须将您的数据视为Any ,这是一切的超类。 There is no encoder that can handle that. 没有编码器可以处理该问题。

So with all that said, your data would have to look like this: 所以说了这么多,您的数据就必须像这样:

val df2 = Seq((Some(1), Some(3)), (Some(3), None))

As a side note, your case class should look like this: 作为附带说明,您的案例类应如下所示:

case class OptionTest(a: Option[Int], b: Option[Int])

If you wan to use Option you have to use it for all records. 如果要使用Option ,则必须将其用于所有记录。 You should also use None instead of Option(null) : 您还应该使用None而不是Option(null)

Seq((1, Some(3)), (3, None)).toDF("a", "b").as[optionTest]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM