简体   繁体   English

Spark:从异构数据写入Paquet

[英]Spark: write Paquet from heterogeneous data

I have a dataset with heterogeneous JSONs that I can group by type and apply a StructType to.我有一个包含异构 JSON 的数据集,我可以按类型对其进行分组并对其应用 StructType。 Eg RDD[(Type, JSON)] and Set[Type] , containing all types from the original RDD.例如RDD[(Type, JSON)]Set[Type] ,包含原始 RDD 的所有类型。

Now I want to write these JSONs into a typed Parquet files, partitioned by type.现在我想将这些 JSON 写入一个类型化的 Parquet 文件,按类型分区。

What I do currently is:我目前做的是:

val getSchema: Type => StructType = ???
val buildRow: StructType => JSON => Row = ???

types.foreach { jsonType =>
  val sparkSchema: StructType = getSchema(jsonType)
  val rows: RDD[Row] = rdd
    .filter(k => k == jsonType)
    .map { case (_, json) => buildRow(sparkSchema)(json) }
  spark.createDataFrame(rows, sparkSchema).write.parquet(path)
}

It works, but very inefficiently as it needs to traverse the original RDD many times - I usually have tens and even hundreds of types.它可以工作,但效率很低,因为它需要多次遍历原始 RDD——我通常有几十种甚至几百种类型。

Are there any better solutions?有没有更好的解决方案? I tried to union the dataframes, but it fails because rows with different arities cannot be coalesced.我试图合并数据框,但它失败了,因为无法合并具有不同参数的行。 Or maybe at least I can release the resources taken by a dataframe by unpersisting it?或者至少我可以通过取消坚持来释放 dataframe 占用的资源?

A few options come to mind (some might or might not be suitable for your requirements).我想到了几个选项(有些可能适合也可能不适合您的要求)。

  1. If the different types don't need to be separate files, you could use a schema like {type1col: StructType1, type2: StructType2, etc} and write out parquet files where only one of the struct columns is populated.如果不同的类型不需要是单独的文件,您可以使用类似{type1col: StructType1, type2: StructType2, etc}的模式,并写出只填充一个结构列的镶木地板文件。
  2. Repartition the data to get all JSON objects of the same type in the same partition (+ some sort of secondary key), and writing out the partitioned data.将数据重新分区,得到同一个分区(+某种二级键)中所有JSON个相同类型的对象,并写出分区数据。 Then read it in and filter by type (if it written out partitioned you would only need to load the data once).然后读入并按类型过滤(如果它是分区写的,你只需要加载一次数据)。 This would only require reading the data 2x (but the shuffle could be expensive).这只需要读取数据 2 次(但洗牌可能很昂贵)。
  3. You could create a custom WriteBatch that operates on rows of (type, JSON), applies the transformation right before writing and keeps one open file handle for each type.您可以创建一个自定义WriteBatch ,它对(类型,JSON)的行进行操作,在写入之前应用转换并为每种类型保留一个打开的文件句柄。 If you require separate files this is probably going to be the most efficient but would require a decent amount of code to maintain and debug.如果您需要单独的文件,这可能是最有效的,但需要大量代码来维护和调试。

This will help in speeding up the parquet file writes.这将有助于加快 parquet 文件的写入速度。 Since types is of type Set[Type] , the foreach loop will happen sequentially.由于typesSet[Type] ,foreach 循环将按顺序发生。 Meaning the parquet files are written one by one (not in parallel).这意味着 parquet 文件是一个一个地写入的(不是并行的)。 And since each parquet file write is independent of each other, there should be no problem if multiple files are written at once.而且由于每个parquet文件的写入都是相互独立的,如果一次写入多个文件应该没有问题。

Using scala parallel collections (refer: https://docs.scala-lang.org/overviews/parallel-collections/overview.html ) would be the most easiest way to achieve it.使用 scala parallel collections(参考: https://docs.scala-lang.org/overviews/parallel-collections/overview.html )将是实现它的最简单方法。 Try editing line 4 to:尝试将第 4 行编辑为:

types.par.foreach { jsonType =>

Also cache the original RDD, if its accessed multiple times.如果多次访问,还缓存原始 RDD。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM