简体   繁体   中英

Transform JSON into Parquet using EMR/Spark

I have a huge amount of JSON files that I need to transform into Parquet . They look something like this:

{
  "foo": "bar",
  "props": {
    "prop1": "val1",
    "prop2": "val2"
  }
}

And I need to transform them into a Parquet file whose structure is this (nested properties are made top-level and get _ as a suffix):

foo=bar
_prop1=val1
_prop2=val2

Now here's the catch : not all of the JSON documents have the same properties. So, if doc1 has prop1 and prop2 , but doc2 has prop3 , the final Parquet file must have the three properties (some of them will be null for some of the records).

I understand that Parquet needs a schema up front, so my current plan is:

  • Traverse all the JSON files
  • Infer a schema per document (using Kite, like this )
  • Merge all the schemas
  • Start writing the Parquet

This approach strikes me as very complicated, slow and error-prone. I'm wondering if there's a better way to achieve this using Spark .

Turns out Spark already does this for you. When it reads JSON documents and you do not specify a schema it will infer/merge them for you. So in my case, something like this would work:

val flattenedJson: RDD[String] = sparkContext.hadoopFile("/file")
  .map(/*parse/flatten json*/)

sqlContext
  .read
  .json(flattenedJson)
  .write
  .parquet("destination")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM