How do I create/mock a Spark Scala dataframe with a case class nested inside the top level?
root
|-- _id: long (nullable = true)
|-- continent: string (nullable = true)
|-- animalCaseClass: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- gender: string (nullable = true)
I am currently unit testing a function which outputs a dataframe in the above schema. To check equality, I used the toDF() which unfortunately gives a schema with nullable = true for "_id" in the mocked dataframe, thus making the test fail (Note that the "actual" output from the function has nullable = true for everything).
I also tried creating the mocked dataframe a different way which led to errors: https://pastebin.com/WtxtgMJA
Here is what I tried in this approach:
import org.apache.spark.sql.Encoders
val animalSchema = Encoders.product[AnimalCaseClass].schema
val schema = List(
StructField("_id", LongType, true),
StructField("continent", StringType, true),
StructField("animalCaseClass", animalSchema, true)
)
val data = Seq(Row(12345L, "Asia", AnimalCaseClass("tiger", "male")), Row(12346L, "Asia", AnimalCaseClass("tigress", "female")))
val expected = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
I had to use this approach to make the nullable true for those fields where toDF makes the nullable false by default.
How could I make a dataframe with the same schema as the output of the mocked function and declare values which can also be a case class?
From the logs you provided, you can see that
Caused by: java.lang.RuntimeException: models.AnimalCaseClass is not a valid external type for schema of struct<name:String,gender:String,,... 3 more fields>
which means you are trying to insert an object type of AnimalCaseClass into a datatype of struct<name:String,gender:String> and this was caused since you have used Row object.
import org.apache.spark.SparkConf
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
import org.apache.spark.sql.SparkSession
case class AnimalCaseClass(name: String, gender: String)
object Test extends App {
val conf: SparkConf = new SparkConf()
conf.setAppName("Test")
conf.setMaster("local[2]")
conf.set("spark.sql.test", "")
conf.set(SQLConf.CODEGEN_FALLBACK.key, "false")
val spark: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
// ** The relevant part **
import org.apache.spark.sql.Encoders
val animalSchema = Encoders.product[AnimalCaseClass].schema
val expectedSchema: StructType = StructType(Seq(
StructField("_id", LongType, true),
StructField("continent", StringType, true),
StructField("animalCaseClass", animalSchema, true)
))
import spark.implicits._
val data = Seq((12345L, "Asia", AnimalCaseClass("tiger", "male")), (12346L, "Asia", AnimalCaseClass("tigress", "female"))).toDF()
val expected = spark.createDataFrame(data.rdd, expectedSchema)
expected.printSchema()
expected.show()
spark.stop()
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.