简体   繁体   English

如何使用嵌套的案例类架构模拟Spark Scala DataFrame?

[英]How to mock a Spark Scala DataFrame with a nested case-class schema?

How do I create/mock a Spark Scala dataframe with a case class nested inside the top level? 如何创建/模拟在顶层嵌套案例类的Spark Scala数据框?

root
 |-- _id: long (nullable = true)
 |-- continent: string (nullable = true)
 |-- animalCaseClass: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- gender: string (nullable = true)

I am currently unit testing a function which outputs a dataframe in the above schema. 我目前正在对在上述模式中输出数据帧的函数进行单元测试。 To check equality, I used the toDF() which unfortunately gives a schema with nullable = true for "_id" in the mocked dataframe, thus making the test fail (Note that the "actual" output from the function has nullable = true for everything). 为了检查是否相等,我使用了toDF(),不幸的是在模拟的数据帧中为“ _id”提供了具有nullable = true的架构,从而使测试失败(请注意,函数的“ actual”输出对所有内容均具有nullable = true )。

I also tried creating the mocked dataframe a different way which led to errors: https://pastebin.com/WtxtgMJA 我还尝试过以导致错误的另一种方式创建模拟的数据框: https : //pastebin.com/WtxtgMJA

Here is what I tried in this approach: 这是我在这种方法中尝试过的:

import org.apache.spark.sql.Encoders
val animalSchema = Encoders.product[AnimalCaseClass].schema

val schema = List(
  StructField("_id", LongType, true),
  StructField("continent", StringType, true),
  StructField("animalCaseClass", animalSchema, true)
)

val data = Seq(Row(12345L, "Asia", AnimalCaseClass("tiger", "male")), Row(12346L, "Asia", AnimalCaseClass("tigress", "female")))

val expected = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

I had to use this approach to make the nullable true for those fields where toDF makes the nullable false by default. 对于默认情况下toDF将nullable设置为false的那些字段,我不得不使用这种方法使nullable设置为true。

How could I make a dataframe with the same schema as the output of the mocked function and declare values which can also be a case class? 我如何才能使数据框具有与模拟函数的输出相同的架构,并声明也可以是case类的值?

From the logs you provided, you can see that 从您提供的日志中,您可以看到

Caused by: java.lang.RuntimeException: models.AnimalCaseClass is not a valid external type for schema of struct<name:String,gender:String,,... 3 more fields> 原因:java.lang.RuntimeException: models.AnimalCaseClass不是结构struct <名称:字符串,性别:字符串,... 3个其他字段的架构的有效外部类型。

which means you are trying to insert an object type of AnimalCaseClass into a datatype of struct<name:String,gender:String> and this was caused since you have used Row object. 这意味着您试图将AnimalCaseClass的对象类型插入到struct <name:String,gender:String>的数据类型中,这是由于您使用Row对象引起的。

import org.apache.spark.SparkConf
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
import org.apache.spark.sql.SparkSession

case class AnimalCaseClass(name: String, gender: String)

object Test extends App {

  val conf: SparkConf = new SparkConf()
  conf.setAppName("Test")
  conf.setMaster("local[2]")
  conf.set("spark.sql.test", "")
  conf.set(SQLConf.CODEGEN_FALLBACK.key, "false")

  val spark: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()

  // ** The relevant part **
  import org.apache.spark.sql.Encoders
  val animalSchema = Encoders.product[AnimalCaseClass].schema

  val expectedSchema: StructType = StructType(Seq(
    StructField("_id", LongType, true),
    StructField("continent", StringType, true),
    StructField("animalCaseClass", animalSchema, true)
  ))

  import spark.implicits._
  val data = Seq((12345L, "Asia", AnimalCaseClass("tiger", "male")), (12346L, "Asia", AnimalCaseClass("tigress", "female"))).toDF()

  val expected = spark.createDataFrame(data.rdd, expectedSchema)

  expected.printSchema()

  expected.show()

  spark.stop()
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM