I am trying to create a StructType schema to pass to the from_json API in order to parse a column stored as a JSON string. The JSON data contains a Map that has String keys and values of type struct, but the schema of each struct depends on the key.
Consider this JSON example where the "data" column is a Map with values name
and address
and the schema of each value is different:
{
"data": {
"name": {
"first": "john"
},
"address": {
"street": "wall",
"zip": 10000
}
}
}
For key "name", the struct value has a single member field "first". For key "address", the struct value has two member fields "street" and "zip".
Can the "data" column be represented as a MapType[StringType, StructType]
in a Spark dataframe?
MapType<String, StructType>
where the StructType is non-homogeneous.EDIT : To add another example of such data which has a Map[String, Struct] where the Struct is not of the same schema throughout the values of the Map, consider the following:
case class Address(street: String, zip: Int)
case class Name(first: String)
case class Employee(id: String, role: String)
val map = Map(
"address" -> Address("wall", 10000),
"name" -> Name("john"),
"occupation" -> Employee("12345", "Software Engineer")
)
As you can see, the values of the map differ in their schema - Address, Name and Employee are all different case classes and their member fields are also different.
You can think of this kind of data coming from a JSON file where a map is allowed to have any arbitrary type of value across keys, and there is no restriction on all the values being of the same type. In my case I will have values that are all structs but the schema of each struct depends on the map key.
You can read your JSON column and parse the schema dynamically:
import org.apache.spark.sql.functions.{col, from_json}
import spark.implicits._
val df = sc.parallelize(Seq(
("""{"data":{"name":{"first":"john"},"address":{"street":"wall","zip":10000},"occupation":{"id":"12345","role":"Software Engineer"}}}"""),
("""{"data":{"name":{"first":"john"},"address":{"street":"wall","zip":10000}}}"""),
)).toDF("my_json_column")
val rows = df.select("my_json_column").as[String]
val schema = spark.read.json(rows).schema
// Transforming your String to Struct
val newDF = df.withColumn("obj", from_json(col("my_json_column"), schema))
newDF.printSchema
// root
// |-- my_json_column: string (nullable = true)
// |-- obj: struct (nullable = true)
// | |-- data: struct (nullable = true)
// | | |-- address: struct (nullable = true)
// | | | |-- street: string (nullable = true)
// | | | |-- zip: long (nullable = true)
// | | |-- name: struct (nullable = true)
// | | | |-- first: string (nullable = true)
// | | |-- occupation: struct (nullable = true)
// | | | |-- id: string (nullable = true)
// | | | |-- role: string (nullable = true)
newDF.select("obj.data", "obj.data.occupation.id").show(false)
Output
+---------------------------------------------------+-----+
|data |id |
+---------------------------------------------------+-----+
|{{wall, 10000}, {john}, {12345, Software Engineer}}|12345|
|{{wall, 10000}, {john}, null} |null |
+---------------------------------------------------+-----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.