简体   繁体   中英

Processing JSON in spark - different schemas in different files

I have a large amount of JSON files obtained from 3rd party. They all have the same schema, except when a nested element is empty, it is represented as empty array.

1st example

{
....
"survey_data":
    {
        "2": { "question":"....", "answer_id":"....", .... },
        "3": { "question":"....", "answer_id":"....", .... },
    }
 }

So this is a valid JSON, the "survey_data" element is a struct_type, but with pretty complicated nested structure (with more sub-elements than in this simplified example)

However, when the survey_data has no nested elements, it is represented as empty array:

{
....
"survey_data": []
 }

which is obviously schematically incompatible, but I cannot affect this since the data is from 3rd party.

When I want to load these JSON files in spark as single dataframe, spark infers the survey_data type as string, and escapes all the characters:

"survey_data":"{\"2\":{\"question\": ...

This is obviously not good for me, I see 2 approaches how to deal with this:

  1. somehow pre-process the files as pure text and remove [] characters ?
  2. use spark to remove the array chars, or to tell spark that the column should be a struct type ?

Anybody able to hint me with a solution for this problem?

I think this should work, did it a long time ago.

If you have a JSON file with schema you are happy with, preferably a small one, you can use its schema to read all other JSON files:

val jsonWithSchema = spark.read.json("PATH_TO_JSON_WITH_RIGHT_SCHEMA")
val df = spark.read.schema(jsonWithSchema.schema).json("PATH_TO_DATAFILES")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM