Parsing JSON in a Spark column

Question

I have a JSON in a dataframe column that is of type String, and I want to convert that to a map. The catch here is that I don't exactly know the schema of the JSON, since the key name can vary.

Basically, my JSON column looks like:

{"outerkey":{"innerkey_1":[{"uid":"1","price":0.01,"type":"STAT"}],
"innerkey_2":[{"uid":"1","price":4.3,"type":"DYN"}],
"innerkey_3":[{"uid":"1","price":2.0,"type":"DYN"}]}}

I want this to eventually look like:

{"outerkey":
[{"keyname":"innerkey_1","uid":"1","price":0.01,"type":"STAT"}, 
{"keyname":"innerkey_2","uid":"1","price":4.3,"type":"DYN"}, 
{"keyname":"innerkey_3","uid":"1","price":2.0,"type":"DYN"}]}

so that I can calculate mean of all prices when type="DYN".

In other words, reading the JSON data using this:

val testJsonData = spark.read.json("file:///data/json_example")

gives me the following schema:

root
 |-- outerkey: struct (nullable = true)
 |    |-- innerkey_1: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_2: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_3: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)

However, I'd like to end up with the much simpler:

root
 |-- outerkey: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- keyname: string (nullable = false)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)

What transformation can I use on the data to be able to end up with the above schema?

Please let me know the easiest way to do this. Thanks in advance!

Answer 1

Your requirement is a bit complex and if my edit of your question is correct then following can be your solution.

you already have input dataframe with schema as

root
 |-- outerkey: struct (nullable = true)
 |    |-- innerkey_1: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_2: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_3: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)

The following step would change the dataframe to have each array in different columns

val tempT = testJsonData.select($"outerkey.*")

whose schema would be

root
 |-- innerkey_1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)
 |-- innerkey_2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)
 |-- innerkey_3: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)

Since you want the keynames in each struct s, you need to get the names

val schema = tempT.schema.fieldNames

Thus schema would be innerkey_1, innerkey_2, innerkey_3

Complex part is to add the keynames inside struct columns which would require two for loops

import org.apache.spark.sql.functions._
for(column <- schema){
  tt = tt.withColumn(column, explode($"${column}"))
}

for(column <- schema){
  tt = tt.withColumn(column, struct(lit(column).as("keyname"), $"${column}.*"))
}

finally tt would have keynames added in the struct columns as

root
 |-- innerkey_1: struct (nullable = false)
 |    |-- keyname: string (nullable = false)
 |    |-- price: double (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- uid: string (nullable = true)
 |-- innerkey_2: struct (nullable = false)
 |    |-- keyname: string (nullable = false)
 |    |-- price: double (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- uid: string (nullable = true)
 |-- innerkey_3: struct (nullable = false)
 |    |-- keyname: string (nullable = false)
 |    |-- price: double (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- uid: string (nullable = true)

Final step would be combine all of them into one column , opposite of what we did in the first step

val temp = tt.select(array(schema.map(col): _*).as("outerkey"))

temp 's schema would be your required schema

root
 |-- outerkey: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- keyname: string (nullable = false)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)

and temp.toJSON.foreach(x => println(x.toString)) should give you your desired json data

{"outerkey":[{"keyname":"innerkey_1","price":0.01,"type":"STAT","uid":"1"},{"keyname":"innerkey_2","price":4.3,"type":"DYN","uid":"1"},{"keyname":"innerkey_3","price":2.0,"type":"DYN","uid":"1"}]}

Parsing JSON in a Spark column

Question

1 answers

solution1
0 2017-12-02 06:59:33

Parsing JSON in a Spark column

Question

1 answers

solution1 0 2017-12-02 06:59:33

solution1
0 2017-12-02 06:59:33