简体   繁体   English

在Spark中分区

[英]Partitioning in Spark

I created an RDD by parallelizing the following array: 我通过并行化以下数组创建了RDD:

var arr: Array[(Int,Char)] = Array()
for (i <- 'a' to 'z') {arr = arr :+ (1,i)} // Key 1 has 25 elements
for (i <- List.range('a','c')) {arr = arr :+ (2,i)} // Key 2 has 2
for (i <- List.range('a','f')) {arr = arr :+ (3,i)} // Key 3 has 5
val rdd = sc.parallelize(arr,8)

I wanted to partition the above RDD so that every partition receives different key and the partitions are of almost the same size. 我想对上述RDD进行分区,以使每个分区接收不同的密钥并且分区的大小几乎相同。 The code below allows me to partition the RDD by keys: 下面的代码使我可以通过键对RDD进行分区:

val prdd = rdd.partitionBy(new HashPartitioner(3))

The partitions created by the above code have the following sizes: 上面的代码创建的分区具有以下大小:

 scala> prdd.mapPartitions(iter=> Iterator(iter.length)).collect
 res43: Array[Int] = Array(25, 2, 5)

Is there a way I can make the partitions of almost equal size from this rdd? 有什么办法可以使这个rdd的分区大小几乎相等吗? So for example for the case above the key 1 has the largest partition size of 25. Can I have partition size like: 因此,例如对于上述情况,键1的最大分区大小为25。我可以将分区大小设置为:

Array[Int] = Array(5, 5, 5, 5, 5, 2, 5)

I tried doing the RangePartition on the above prdd but it didn't work. 我试着做RangePartition以上prdd ,但没有奏效。

The issue you're running into is inherent in your data. 您遇到的问题是数据固有的。

  1. Your keys have a very imbalanced distribution 您的钥匙分配非常不平衡
  2. You want all keys to be grouped together. 您希望将所有键分组在一起。

There's really no way to have even distribution given these two! 确实没有办法均匀分配这两个! If you print the partition sizes when you first call parallelize , you'll see that the partitions are relatively balanced -- sc.parallelize will chunk the data evenly. 如果在首次调用parallelize时打印分区大小,您会看到分区相对平衡sc.parallelize将均匀地分块数据。

Spark partitioners provide a deterministic function from Key K to partition index p . Spark分区程序提供了从Key K到分区索引p的确定性功能。 There's no way to have several partitions for the "1" key while preserving this function. 在保留此功能的同时,无法为“ 1”键提供多个分区。 Range partitions are useful for maintaining order on an RDD, but won't help here -- for any given key, there can only be one partition you need to look in. 范围分区对于维护RDD上的顺序很有用,但在这里无济于事-对于任何给定的键,您只需要查看一个分区。

Are you partitioning so that you can do key/value RDD operations like join or reduceByKey later? 您是否正在分区,以便以后可以执行键/值RDD操作(例如joinreduceByKey If so, then you're out of luck. 如果是这样,那您就不走运了。 If not, then we can play some tricks with partitioning by key/value combinations rather than just key! 如果没有,那么我们可以通过键/值组合而不是键进行分区操作!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM