Transform JSON into Parquet using EMR/Spark

Question

I have a huge amount of JSON files that I need to transform into Parquet . They look something like this:

{
  "foo": "bar",
  "props": {
    "prop1": "val1",
    "prop2": "val2"
  }
}

And I need to transform them into a Parquet file whose structure is this (nested properties are made top-level and get _ as a suffix):

foo=bar
_prop1=val1
_prop2=val2

Now here's the catch : not all of the JSON documents have the same properties. So, if doc1 has prop1 and prop2 , but doc2 has prop3 , the final Parquet file must have the three properties (some of them will be null for some of the records).

I understand that Parquet needs a schema up front, so my current plan is:

Traverse all the JSON files
Infer a schema per document (using Kite, like this )
Merge all the schemas
Start writing the Parquet

This approach strikes me as very complicated, slow and error-prone. I'm wondering if there's a better way to achieve this using Spark .

Answer 1

Turns out Spark already does this for you. When it reads JSON documents and you do not specify a schema it will infer/merge them for you. So in my case, something like this would work:

val flattenedJson: RDD[String] = sparkContext.hadoopFile("/file")
  .map(/*parse/flatten json*/)

sqlContext
  .read
  .json(flattenedJson)
  .write
  .parquet("destination")

Transform JSON into Parquet using EMR/Spark

Question

1 answers

solution1
0 ACCPTED 2017-06-07 10:42:49

Transform JSON into Parquet using EMR/Spark

Question

1 answers

solution1 0 ACCPTED 2017-06-07 10:42:49

solution1
0 ACCPTED 2017-06-07 10:42:49