简体   繁体   中英

Spark DataFrame nulls to Dataset

When importing data from a MS SQL database, there is the potential for null values. In Spark, DataFrames are able to handle the null values. But when I try to convert the DataFrame to a strongly typed Dataset, I receive encoder errors.

Here's a simple example:

case class optionTest(var a: Option[Int], var b: Option[Int])

object testObject {
  def main(args: Array[String]): Unit = {
    import spark.implicits._
    val df = spark.sparkContext.parallelize(Seq(input)).toDF()

    val df2 = Seq((1, 3), (3, Option(null)))
                 .toDF("a", "b")
                 .as[optionTest]

    df2.show()
  }
}

Here is the error for this case:

No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"
java.lang.UnsupportedOperationException: No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"

What is the recommended approach to handle nullable values when creating a Dataset from a DataFrame?

The problem is that your Dataframe doesn't match your case class.

Your first pair is an (Int, Int) , and your second is an (Int, Option[Null]) .

The easy thing to note is that if you want to represent an Option[Int] , the value will be either Some(3) , for example, or None for an absent value.

The tricky thing to note is that in Scala Int is a subclass of AnyVal while nullable references, which should be almost nonexistent in the Scala code you write, are on the AnyRef side of the Scala object hierarchy.

Because you have a bunch of objects that are all over the Scala object model, Spark has to treat your data as Any , the superclass of everything. There is no encoder that can handle that.

So with all that said, your data would have to look like this:

val df2 = Seq((Some(1), Some(3)), (Some(3), None))

As a side note, your case class should look like this:

case class OptionTest(a: Option[Int], b: Option[Int])

If you wan to use Option you have to use it for all records. You should also use None instead of Option(null) :

Seq((1, Some(3)), (3, None)).toDF("a", "b").as[optionTest]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM