简体   繁体   中英

Spark: write Paquet from heterogeneous data

I have a dataset with heterogeneous JSONs that I can group by type and apply a StructType to. Eg RDD[(Type, JSON)] and Set[Type] , containing all types from the original RDD.

Now I want to write these JSONs into a typed Parquet files, partitioned by type.

What I do currently is:

val getSchema: Type => StructType = ???
val buildRow: StructType => JSON => Row = ???

types.foreach { jsonType =>
  val sparkSchema: StructType = getSchema(jsonType)
  val rows: RDD[Row] = rdd
    .filter(k => k == jsonType)
    .map { case (_, json) => buildRow(sparkSchema)(json) }
  spark.createDataFrame(rows, sparkSchema).write.parquet(path)
}

It works, but very inefficiently as it needs to traverse the original RDD many times - I usually have tens and even hundreds of types.

Are there any better solutions? I tried to union the dataframes, but it fails because rows with different arities cannot be coalesced. Or maybe at least I can release the resources taken by a dataframe by unpersisting it?

A few options come to mind (some might or might not be suitable for your requirements).

  1. If the different types don't need to be separate files, you could use a schema like {type1col: StructType1, type2: StructType2, etc} and write out parquet files where only one of the struct columns is populated.
  2. Repartition the data to get all JSON objects of the same type in the same partition (+ some sort of secondary key), and writing out the partitioned data. Then read it in and filter by type (if it written out partitioned you would only need to load the data once). This would only require reading the data 2x (but the shuffle could be expensive).
  3. You could create a custom WriteBatch that operates on rows of (type, JSON), applies the transformation right before writing and keeps one open file handle for each type. If you require separate files this is probably going to be the most efficient but would require a decent amount of code to maintain and debug.

This will help in speeding up the parquet file writes. Since types is of type Set[Type] , the foreach loop will happen sequentially. Meaning the parquet files are written one by one (not in parallel). And since each parquet file write is independent of each other, there should be no problem if multiple files are written at once.

Using scala parallel collections (refer: https://docs.scala-lang.org/overviews/parallel-collections/overview.html ) would be the most easiest way to achieve it. Try editing line 4 to:

types.par.foreach { jsonType =>

Also cache the original RDD, if its accessed multiple times.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM