简体   繁体   中英

Parse nested JSON stringified column in Spark Streaming SQL

I followed the spark streaming guide and was able to get a sql context of my json data using sqlContext.read.json(rdd) . The problem is that one of the json fields is a JSON string itself that I would like parsed.

Is there a way to accomplish this within spark sql, or would it be easier to use ObjectMapper to parse the string and join to the rest of the data?

To clarify, one of the values of the JSON is a string containing JSON data with the inner quotes escaped. I'm looking for a way to tell the parser to treat that value as stringified JSON

Example Json

{ 
  "key": "val",
  "jsonString": "{ \"too\": \"bad\" }",
  "jsonObj": { "ok": "great" }
}

How SQLContext Parses it

root
 |-- key: string (nullable = true)
 |-- jsonString: string (nullable = true)
 |-- jsonObj: struct (nullable = true)
 |    |-- ok: string (nullable = true)

How I would like it

root
 |-- key: string (nullable = true)
 |-- jsonString: struct (nullable = true)
 |    |-- too: string (nullable = true)
 |-- jsonObj: struct (nullable = true)
 |    |-- ok: string (nullable = true)

You can use the from_json funtion to parse the column of a DataSet:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val stringified = spark.createDataset(Seq("{ \"too\": \"bad\" }", "{ \"too\": \"sad\" }"))
stringified.printSchema()

val structified = stringified.withColumn("value", from_json($"value", StructType(Seq(StructField("too", StringType, false)))))
structified.printSchema()

Which converts the value column from a string to a struct :

root
 |-- value: string (nullable = true)

root
 |-- value: struct (nullable = true)
 |    |-- too: string (nullable = false)

Older RDD API Approach (see accepted answer for DataFrame API)

I ended up using Jackson to parse the json envelope, then again to parse the inner escaped string.

val parsedRDD = rdd.map(x => {

      // Get Jackson mapper
      val mapper = new ObjectMapper() with ScalaObjectMapper
      mapper.registerModule(DefaultScalaModule)

      // parse envelope
      val envelopeMap = mapper.readValue[Map[String,Any]](x)
      //println("the original envelopeMap", envelopeMap)

      // parse inner jsonString value
      val event = mapper.readValue[Map[String,Any]](envelopeMap.getOrElse("body", "").asInstanceOf[String])

      // get Map that includes parsed jsonString
      val parsed = envelopeMap.updated("jsonString", event)

      // write entire map as json string
      mapper.writeValueAsString(parsed)
})

val df = sqlContext.read.json(parsedRDD)

Now parsedRDD contains valid json and the dataframe properly infers the entire schema.

I think there must be a way to avoid having to serialize to json and parse again but so far I don't see any sqlContext APIs that operate on RDD[Map[String, Any]]

Obviously

"jsonString": "{ \\"too\\": \\"bad\\" }"

is not valid json data, fix : and make sure entire string is valid json structure.

The json, you have provided is wrong, so fixed and giving you an example.

Lets take the json as below. {"key": "val","jsonString": {"too": "bad"},"jsonObj": {"ok": "great"}}

Spark SQL Json parser will allow you to read nested json as well, frankly if that is not provided, it would have been incomplete, coz you will see almost 99% nested jsons.

Coming to how to access it, you need to select using . . Here it is, jsonString.too or jsonObj.ok.

Below is the example to understand

scala> val df1 = sqlContext.read.json("/Users/srini/workspace/splunk_spark/file3.json").toDF
df1: org.apache.spark.sql.DataFrame = [jsonObj: struct<ok:string>, jsonString: struct<too:string>, key: string]

scala> df1.show
+-------+----------+---+
|jsonObj|jsonString|key|
+-------+----------+---+
|[great]|     [bad]|val|
+-------+----------+---+


scala> df1.select("jsonString.too");
res12: org.apache.spark.sql.DataFrame = [too: string]

scala> df1.select("jsonString.too").show
+---+
|too|
+---+
|bad|
+---+


scala> df1.select("jsonObj.ok").show
+-----+
|   ok|
+-----+
|great|
+-----+

Hope you can understand. Reply back, if you need any more info. Its just parent node. child node. thats it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM