Parsing multiple JSON schemas with Spark

Question

I need to collect a few key pieces of information from a large number of somewhat complex nested JSON messages which are evolving over time. Each message refers to the same type of event but the messages are generated by several producers and come in two (and likely more in the future) schemas. The key information from each message is similar but the mapping to those fields is dependent on the message type.

I can't share the actual data but here is an example:

Message A
—header:
|—attribute1
|—attribute2
—typeA:
|—typeAStruct1:
||—property1
|-typeAStruct2:
||-property2


Message B
-attribute1
-attribute2
-contents:
|-message:
||-TypeB:
|||-property1
|||-TypeBStruct:
||||-property2

I want to produce a table of data which looks something like this regardless of message type:

| MessageSchema | Property1 | Property2 |
| :———————————- | :———————— | :———————— |
| MessageA      | A1        | A2        |
| MessageB      | B1        | B2        |
| MessageA      | A3        | A4        |
| MessageB      | B3        | B4        |

My current strategy is read the data with schema A and union with the data read with Schema B. Then I can filter the nulls that result from parsing a type A message with a B schema and vice versa. This seems very inefficient especially once a third or fourth schema emerge. I would like to be able to parse the message correctly on the first pass and apply the correct schema.

Answer 1

As i see it - there is only one way:

For each message type you create an 'adapter' that will create dataframe from input and transform it to the common schema dataframe
Then union outputs of the adapters

Obviously, if you change 'common' schema - you will need to tailor your 'adapters' as well.

Parsing multiple JSON schemas with Spark

Question

1 answers

solution1
0 2018-03-06 20:00:25

Parsing multiple JSON schemas with Spark

Question

1 answers

solution1 0 2018-03-06 20:00:25

solution1
0 2018-03-06 20:00:25