如何使用嵌套的案例类架构模拟Spark Scala DataFrame？

Question

How do I create/mock a Spark Scala dataframe with a case class nested inside the top level? 如何创建/模拟在顶层嵌套案例类的Spark Scala数据框？

root
 |-- _id: long (nullable = true)
 |-- continent: string (nullable = true)
 |-- animalCaseClass: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- gender: string (nullable = true)

I am currently unit testing a function which outputs a dataframe in the above schema. 我目前正在对在上述模式中输出数据帧的函数进行单元测试。 To check equality, I used the toDF() which unfortunately gives a schema with nullable = true for "_id" in the mocked dataframe, thus making the test fail (Note that the "actual" output from the function has nullable = true for everything). 为了检查是否相等，我使用了toDF（），不幸的是在模拟的数据帧中为“ _id”提供了具有nullable = true的架构，从而使测试失败（请注意，函数的“ actual”输出对所有内容均具有nullable = true ）。

I also tried creating the mocked dataframe a different way which led to errors: https://pastebin.com/WtxtgMJA 我还尝试过以导致错误的另一种方式创建模拟的数据框： https : //pastebin.com/WtxtgMJA

Here is what I tried in this approach: 这是我在这种方法中尝试过的：

import org.apache.spark.sql.Encoders
val animalSchema = Encoders.product[AnimalCaseClass].schema

val schema = List(
  StructField("_id", LongType, true),
  StructField("continent", StringType, true),
  StructField("animalCaseClass", animalSchema, true)
)

val data = Seq(Row(12345L, "Asia", AnimalCaseClass("tiger", "male")), Row(12346L, "Asia", AnimalCaseClass("tigress", "female")))

val expected = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

I had to use this approach to make the nullable true for those fields where toDF makes the nullable false by default. 对于默认情况下toDF将nullable设置为false的那些字段，我不得不使用这种方法使nullable设置为true。

How could I make a dataframe with the same schema as the output of the mocked function and declare values which can also be a case class? 我如何才能使数据框具有与模拟函数的输出相同的架构，并声明也可以是case类的值？

Answer 1

From the logs you provided, you can see that 从您提供的日志中，您可以看到

Caused by: java.lang.RuntimeException: models.AnimalCaseClass is not a valid external type for schema of struct<name:String,gender:String,,... 3 more fields> 原因：java.lang.RuntimeException： models.AnimalCaseClass不是结构struct <名称：字符串，性别：字符串，... 3个其他字段的架构的有效外部类型。

which means you are trying to insert an object type of AnimalCaseClass into a datatype of struct<name:String,gender:String> and this was caused since you have used Row object. 这意味着您试图将AnimalCaseClass的对象类型插入到struct <name：String，gender：String>的数据类型中，这是由于您使用Row对象引起的。

import org.apache.spark.SparkConf
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
import org.apache.spark.sql.SparkSession

case class AnimalCaseClass(name: String, gender: String)

object Test extends App {

  val conf: SparkConf = new SparkConf()
  conf.setAppName("Test")
  conf.setMaster("local[2]")
  conf.set("spark.sql.test", "")
  conf.set(SQLConf.CODEGEN_FALLBACK.key, "false")

  val spark: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()

  // ** The relevant part **
  import org.apache.spark.sql.Encoders
  val animalSchema = Encoders.product[AnimalCaseClass].schema

  val expectedSchema: StructType = StructType(Seq(
    StructField("_id", LongType, true),
    StructField("continent", StringType, true),
    StructField("animalCaseClass", animalSchema, true)
  ))

  import spark.implicits._
  val data = Seq((12345L, "Asia", AnimalCaseClass("tiger", "male")), (12346L, "Asia", AnimalCaseClass("tigress", "female"))).toDF()

  val expected = spark.createDataFrame(data.rdd, expectedSchema)

  expected.printSchema()

  expected.show()

  spark.stop()
}

如何使用嵌套的案例类架构模拟Spark Scala DataFrame？

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-09-18 20:10:32

如何使用嵌套的案例类架构模拟Spark Scala DataFrame？

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-09-18 20:10:32

解决方案1
0 已采纳 2018-09-18 20:10:32