简体   繁体   English

使用Apache Spark进行分区

[英]Repartition with Apache Spark

The problem : I am trying to repartition a dataset so that all rows that have the same number in a specified column of intergers are in the same partition. 问题 :我正在尝试对数据集进行重新分区,以使指定整数列中所有具有相同编号的行都位于同一分区中。

What is working : when I use the 1.6 API (in Java) with RDD I use a hash partitioner and this work as expected. 什么是有效的 :当我将1.6 API(在Java中)与RDD一起使用时,我使用了哈希分区程序,并且可以按预期工作。 For example if I print the modulo of each value of this column for each row I get the same modulo in a given partition (I read the partition by manually reading the content saved with saveAsHadoopFile). 例如,如果我为每一行打印此列的每个值的模,我将在给定分区中获得相同的模(我通过手动读取用saveAsHadoopFile保存的内容来读取分区)。

It is not working as expected with the latest API 最新的API无法正常工作

But now I am trying to use the 2.0.1 API (in Scala) and the Datasets which have a repartition method that take a number of partitions and a column and save this DataSet as a parquet file. 但是现在我正在尝试使用2.0.1 API(在Scala中)和具有重新分区方法的数据集,该方法采用多个分区和一列并将此DataSet保存为拼花文件。 The results are not the same if I look in the partitions the rows are not haspartitioned given this column. 如果我在给定此列的行中未对分区进行分区,结果将不一样。

To save partitioned Dataset you can use either: 要保存分区Dataset ,可以使用以下任一方法:

  • DataFrameWriter.partitionBy - available since Spark 1.6 DataFrameWriter.partitionBy从Spark 1.6开始可用

     df.write.partitionBy("someColumn").format(...).save() 
  • DataFrameWriter.bucketBy - available since Spark 2.0 DataFrameWriter.bucketBy自Spark 2.0起可用

     df.write.bucketBy("someColumn").format(...).save() 

Using df.partitionBy("someColumn").write.format(...).save should work as well but Dataset API doesn't use hashcodes. 使用df.partitionBy("someColumn").write.format(...).save也可以正常工作,但Dataset API不使用哈希码。 It uses MurmurHash so results will be different than the results from HashParitioner in RDD API and trivial checks (like the one you described) won't work. 它使用MurmurHash因此结果将与RDD API中HashParitioner的结果不同,并且琐碎的检查(如您所描述的检查)将不起作用。

val oldHashCode = udf((x: Long) => x.hashCode)

// https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/util/Utils.scala#L1596-L1599
val nonNegativeMode = udf((x: Int, mod: Int) => {
  val rawMod = x % mod
  rawMod + (if (rawMod < 0) mod else 0)
})

val df = spark.range(0, 10)

val oldPart = nonNegativeMode(oldHashCode($"id"), lit(3))
val newPart = nonNegativeMode(hash($"id"), lit(3))

df.select($"*", oldPart, newPart).show
+---+---------------+--------------------+
| id|UDF(UDF(id), 3)|UDF(hash(id, 42), 3)|
+---+---------------+--------------------+
|  0|              0|                   1|
|  1|              1|                   2|
|  2|              2|                   2|
|  3|              0|                   0|
|  4|              1|                   2|
|  5|              2|                   2|
|  6|              0|                   0|
|  7|              1|                   0|
|  8|              2|                   2|
|  9|              0|                   2|
+---+---------------+--------------------+

One possible gotcha is that DataFrame writer can merge multiple small files to reduce the cost so data from different partitions can be put in a single file. 一个可能的DataFrame是, DataFrame writer可以合并多个小文件以降低成本,因此可以将来自不同分区的数据放在单个文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM