I have a JSON in a dataframe column that is of type String, and I want to convert that to a map. The catch here is that I don't exactly know the schema of the JSON, since the key name can vary.
Basically, my JSON column looks like:
{"outerkey":{"innerkey_1":[{"uid":"1","price":0.01,"type":"STAT"}],
"innerkey_2":[{"uid":"1","price":4.3,"type":"DYN"}],
"innerkey_3":[{"uid":"1","price":2.0,"type":"DYN"}]}}
I want this to eventually look like:
{"outerkey":
[{"keyname":"innerkey_1","uid":"1","price":0.01,"type":"STAT"},
{"keyname":"innerkey_2","uid":"1","price":4.3,"type":"DYN"},
{"keyname":"innerkey_3","uid":"1","price":2.0,"type":"DYN"}]}
so that I can calculate mean of all prices when type="DYN".
In other words, reading the JSON data using this:
val testJsonData = spark.read.json("file:///data/json_example")
gives me the following schema:
root
|-- outerkey: struct (nullable = true)
| |-- innerkey_1: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
However, I'd like to end up with the much simpler:
root
|-- outerkey: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- keyname: string (nullable = false)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
What transformation can I use on the data to be able to end up with the above schema?
Please let me know the easiest way to do this. Thanks in advance!
Your requirement is a bit complex and if my edit of your question is correct then following can be your solution.
you already have input dataframe
with schema
as
root
|-- outerkey: struct (nullable = true)
| |-- innerkey_1: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
The following step would change the dataframe
to have each array
in different columns
val tempT = testJsonData.select($"outerkey.*")
whose schema
would be
root
|-- innerkey_1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
|-- innerkey_2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
|-- innerkey_3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
Since you want the keynames in each struct
s, you need to get the names
val schema = tempT.schema.fieldNames
Thus schema would be innerkey_1, innerkey_2, innerkey_3
Complex part is to add the keynames inside struct
columns which would require two for
loops
import org.apache.spark.sql.functions._
for(column <- schema){
tt = tt.withColumn(column, explode($"${column}"))
}
for(column <- schema){
tt = tt.withColumn(column, struct(lit(column).as("keyname"), $"${column}.*"))
}
finally tt
would have keynames added in the struct
columns as
root
|-- innerkey_1: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
|-- innerkey_2: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
|-- innerkey_3: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
Final step would be combine all of them into one column
, opposite of what we did in the first step
val temp = tt.select(array(schema.map(col): _*).as("outerkey"))
temp
's schema
would be your required schema
root
|-- outerkey: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- keyname: string (nullable = false)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
and temp.toJSON.foreach(x => println(x.toString))
should give you your desired json
data
{"outerkey":[{"keyname":"innerkey_1","price":0.01,"type":"STAT","uid":"1"},{"keyname":"innerkey_2","price":4.3,"type":"DYN","uid":"1"},{"keyname":"innerkey_3","price":2.0,"type":"DYN","uid":"1"}]}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.