简体   繁体   中英

Spark Dataframe: Representing Schema of MapType with non homogeneous data types in StructType values

I am trying to create a StructType schema to pass to the from_json API in order to parse a column stored as a JSON string. The JSON data contains a Map that has String keys and values of type struct, but the schema of each struct depends on the key.

Consider this JSON example where the "data" column is a Map with values name and address and the schema of each value is different:

{
  "data": {
    "name": {
      "first": "john"
    },
    "address": {
      "street": "wall",
      "zip": 10000
    }
  }
}

For key "name", the struct value has a single member field "first". For key "address", the struct value has two member fields "street" and "zip".

Can the "data" column be represented as a MapType[StringType, StructType] in a Spark dataframe?

  1. Does Spark handle a Map[String, Struct] where the structs are non-homogeneous?
  2. If yes, please share an example of a StructType schema representing a dataframe with schema MapType<String, StructType> where the StructType is non-homogeneous.

EDIT : To add another example of such data which has a Map[String, Struct] where the Struct is not of the same schema throughout the values of the Map, consider the following:

case class Address(street: String, zip: Int)
case class Name(first: String)
case class Employee(id: String, role: String)
val map = Map(
  "address" -> Address("wall", 10000),
  "name" -> Name("john"),
  "occupation" -> Employee("12345", "Software Engineer")
)

As you can see, the values of the map differ in their schema - Address, Name and Employee are all different case classes and their member fields are also different.

You can think of this kind of data coming from a JSON file where a map is allowed to have any arbitrary type of value across keys, and there is no restriction on all the values being of the same type. In my case I will have values that are all structs but the schema of each struct depends on the map key.

You can read your JSON column and parse the schema dynamically:

import org.apache.spark.sql.functions.{col, from_json}
import spark.implicits._


val df = sc.parallelize(Seq(
  ("""{"data":{"name":{"first":"john"},"address":{"street":"wall","zip":10000},"occupation":{"id":"12345","role":"Software Engineer"}}}"""),
  ("""{"data":{"name":{"first":"john"},"address":{"street":"wall","zip":10000}}}"""),
)).toDF("my_json_column")

val rows = df.select("my_json_column").as[String]
val schema = spark.read.json(rows).schema

// Transforming your String to Struct
val newDF = df.withColumn("obj", from_json(col("my_json_column"), schema))

newDF.printSchema
// root
//  |-- my_json_column: string (nullable = true)
//  |-- obj: struct (nullable = true)
//  |    |-- data: struct (nullable = true)
//  |    |    |-- address: struct (nullable = true)
//  |    |    |    |-- street: string (nullable = true)
//  |    |    |    |-- zip: long (nullable = true)
//  |    |    |-- name: struct (nullable = true)
//  |    |    |    |-- first: string (nullable = true)
//  |    |    |-- occupation: struct (nullable = true)
//  |    |    |    |-- id: string (nullable = true)
//  |    |    |    |-- role: string (nullable = true)

newDF.select("obj.data", "obj.data.occupation.id").show(false)

Output

+---------------------------------------------------+-----+
|data                                               |id   |
+---------------------------------------------------+-----+
|{{wall, 10000}, {john}, {12345, Software Engineer}}|12345|
|{{wall, 10000}, {john}, null}                      |null |
+---------------------------------------------------+-----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM