简体   繁体   中英

Parsing multiple JSON schemas with Spark

I need to collect a few key pieces of information from a large number of somewhat complex nested JSON messages which are evolving over time. Each message refers to the same type of event but the messages are generated by several producers and come in two (and likely more in the future) schemas. The key information from each message is similar but the mapping to those fields is dependent on the message type.

I can't share the actual data but here is an example:

Message A
—header:
|—attribute1
|—attribute2
—typeA:
|—typeAStruct1:
||—property1
|-typeAStruct2:
||-property2


Message B
-attribute1
-attribute2
-contents:
|-message:
||-TypeB:
|||-property1
|||-TypeBStruct:
||||-property2

I want to produce a table of data which looks something like this regardless of message type:

| MessageSchema | Property1 | Property2 |
| :———————————- | :———————— | :———————— |
| MessageA      | A1        | A2        |
| MessageB      | B1        | B2        |
| MessageA      | A3        | A4        |
| MessageB      | B3        | B4        |

My current strategy is read the data with schema A and union with the data read with Schema B. Then I can filter the nulls that result from parsing a type A message with a B schema and vice versa. This seems very inefficient especially once a third or fourth schema emerge. I would like to be able to parse the message correctly on the first pass and apply the correct schema.

As i see it - there is only one way:

  • For each message type you create an 'adapter' that will create dataframe from input and transform it to the common schema dataframe
  • Then union outputs of the adapters

Obviously, if you change 'common' schema - you will need to tailor your 'adapters' as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM