I have a large amount of JSON files obtained from 3rd party. They all have the same schema, except when a nested element is empty, it is represented as empty array.
1st example
{
....
"survey_data":
{
"2": { "question":"....", "answer_id":"....", .... },
"3": { "question":"....", "answer_id":"....", .... },
}
}
So this is a valid JSON, the "survey_data" element is a struct_type, but with pretty complicated nested structure (with more sub-elements than in this simplified example)
However, when the survey_data has no nested elements, it is represented as empty array:
{
....
"survey_data": []
}
which is obviously schematically incompatible, but I cannot affect this since the data is from 3rd party.
When I want to load these JSON files in spark as single dataframe, spark infers the survey_data type as string, and escapes all the characters:
"survey_data":"{\"2\":{\"question\": ...
This is obviously not good for me, I see 2 approaches how to deal with this:
Anybody able to hint me with a solution for this problem?
I think this should work, did it a long time ago.
If you have a JSON file with schema you are happy with, preferably a small one, you can use its schema to read all other JSON files:
val jsonWithSchema = spark.read.json("PATH_TO_JSON_WITH_RIGHT_SCHEMA")
val df = spark.read.schema(jsonWithSchema.schema).json("PATH_TO_DATAFILES")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.