[英]Spark: write Paquet from heterogeneous data
I have a dataset with heterogeneous JSONs that I can group by type and apply a StructType to.我有一个包含异构 JSON 的数据集,我可以按类型对其进行分组并对其应用 StructType。 Eg
RDD[(Type, JSON)]
and Set[Type]
, containing all types from the original RDD.例如
RDD[(Type, JSON)]
和Set[Type]
,包含原始 RDD 的所有类型。
Now I want to write these JSONs into a typed Parquet files, partitioned by type.现在我想将这些 JSON 写入一个类型化的 Parquet 文件,按类型分区。
What I do currently is:我目前做的是:
val getSchema: Type => StructType = ???
val buildRow: StructType => JSON => Row = ???
types.foreach { jsonType =>
val sparkSchema: StructType = getSchema(jsonType)
val rows: RDD[Row] = rdd
.filter(k => k == jsonType)
.map { case (_, json) => buildRow(sparkSchema)(json) }
spark.createDataFrame(rows, sparkSchema).write.parquet(path)
}
It works, but very inefficiently as it needs to traverse the original RDD many times - I usually have tens and even hundreds of types.它可以工作,但效率很低,因为它需要多次遍历原始 RDD——我通常有几十种甚至几百种类型。
Are there any better solutions?有没有更好的解决方案? I tried to union the dataframes, but it fails because rows with different arities cannot be coalesced.
我试图合并数据框,但它失败了,因为无法合并具有不同参数的行。 Or maybe at least I can release the resources taken by a dataframe by unpersisting it?
或者至少我可以通过取消坚持来释放 dataframe 占用的资源?
A few options come to mind (some might or might not be suitable for your requirements).我想到了几个选项(有些可能适合也可能不适合您的要求)。
{type1col: StructType1, type2: StructType2, etc}
and write out parquet files where only one of the struct columns is populated.{type1col: StructType1, type2: StructType2, etc}
的模式,并写出只填充一个结构列的镶木地板文件。 This will help in speeding up the parquet file writes.这将有助于加快 parquet 文件的写入速度。 Since
types
is of type Set[Type]
, the foreach loop will happen sequentially.由于
types
是Set[Type]
,foreach 循环将按顺序发生。 Meaning the parquet files are written one by one (not in parallel).这意味着 parquet 文件是一个一个地写入的(不是并行的)。 And since each parquet file write is independent of each other, there should be no problem if multiple files are written at once.
而且由于每个parquet文件的写入都是相互独立的,如果一次写入多个文件应该没有问题。
Using scala parallel collections (refer: https://docs.scala-lang.org/overviews/parallel-collections/overview.html ) would be the most easiest way to achieve it.使用 scala parallel collections(参考: https://docs.scala-lang.org/overviews/parallel-collections/overview.html )将是实现它的最简单方法。 Try editing line 4 to:
尝试将第 4 行编辑为:
types.par.foreach { jsonType =>
Also cache the original RDD, if its accessed multiple times.如果多次访问,还缓存原始 RDD。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.