I have a dataset with heterogeneous JSONs that I can group by type and apply a StructType to. Eg RDD[(Type, JSON)]
and Set[Type]
, containing all types from the original RDD.
Now I want to write these JSONs into a typed Parquet files, partitioned by type.
What I do currently is:
val getSchema: Type => StructType = ???
val buildRow: StructType => JSON => Row = ???
types.foreach { jsonType =>
val sparkSchema: StructType = getSchema(jsonType)
val rows: RDD[Row] = rdd
.filter(k => k == jsonType)
.map { case (_, json) => buildRow(sparkSchema)(json) }
spark.createDataFrame(rows, sparkSchema).write.parquet(path)
}
It works, but very inefficiently as it needs to traverse the original RDD many times - I usually have tens and even hundreds of types.
Are there any better solutions? I tried to union the dataframes, but it fails because rows with different arities cannot be coalesced. Or maybe at least I can release the resources taken by a dataframe by unpersisting it?
A few options come to mind (some might or might not be suitable for your requirements).
{type1col: StructType1, type2: StructType2, etc}
and write out parquet files where only one of the struct columns is populated. This will help in speeding up the parquet file writes. Since types
is of type Set[Type]
, the foreach loop will happen sequentially. Meaning the parquet files are written one by one (not in parallel). And since each parquet file write is independent of each other, there should be no problem if multiple files are written at once.
Using scala parallel collections (refer: https://docs.scala-lang.org/overviews/parallel-collections/overview.html ) would be the most easiest way to achieve it. Try editing line 4 to:
types.par.foreach { jsonType =>
Also cache the original RDD, if its accessed multiple times.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.