在pyspark中，为什么`limit`后跟`repartition'创建完全相同的分区大小？

Question

According to the pyspark documentation , repartition is supposed to use hash partitioning, which would give slightly unequal partition sizes. 根据pyspark文档， repartition应该使用散列分区，这将给出稍微不相等的分区大小。 However, I have found that by preceding it with limit , it will produce exactly equal partition sizes. 但是，我发现通过在limit之前加上它，它将产生完全相同的分区大小。 This can be shown by running the following in a pyspark shell: 这可以通过在pyspark shell中运行以下内容来显示：

df = spark.createDataFrame([range(5)] * 100)

def count_part_size(part_iter):
    yield len(list(part_iter))

print(df.repartition(20).rdd.mapPartitions(count_part_size).collect())
# [4, 4, 4, 5, 4, 4, 5, 4, 5, 6, 6, 6, 7, 5, 5, 5, 5, 6, 5, 5]

print(df.limit(100).repartition(20).rdd.mapPartitions(count_part_size).collect())
# [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]

If repartition is using a hash partitioner, why would it produce exactly equal partition sizes in this case? 如果repartition使用散列分区程序，为什么在这种情况下它会产生完全相同的分区大小？ And if it is not using a hash partitioner, what kind of partitioner is it using? 如果它没有使用散列分区程序，它使用什么样的分区程序？

By the way, I am using python version 2.7.15 and spark version 2.0.2 顺便说一下，我使用的是python版本2.7.15和spark版本2.0.2

Answer 1

There are four factors here: 这里有四个因素：

If no partitioning expression is provided, repartition doesn't use HashPartitioning , or to be specific, it doesn't use it directly. 如果未提供分区表达式，则repartition不使用HashPartitioning ，或者具体而言，它不直接使用它。 Instead it uses RoundRobinPartitioning , which (as you can probably guess) 相反，它使用RoundRobinPartitioning ，它（因为你可能已经猜到）

Distributes elements evenly across output partitions, starting from a random partition. 从随机分区开始，在输出分区之间均匀分配元素。

Internally, it generates a sequence of scala.Int on each partition, starting from a random point . 在内部，它从每个分区生成一系列scala.Int ，从随机点开始。 Only these values are passed through HashPartitioner . 只有这些值通过HashPartitioner传递。
It works this way because Int hashCode is simply identity - in other words 它的工作方式是这样的，因为Int hashCode只是身份 - 换句话说
∀x∈Int x = hashCode(x) ∀x∈Intx= hashCode（x）
(that's BTW the same behavior as of CPython hash in the Scala Int range - -2147483648 to 2147483647. These hashes are simply not designed to be cryptographically secure) As a result applying HashPartitioner to a series of Int values results in actual Round Robin assignment. （这是与Scala Int范围中的CPython hash相同的行为 - -2147483648到2147483647.这些哈希根本不是设计为加密安全的）因此将HashPartitioner应用于一系列Int值导致实际的Round Robin分配。
So in such case HashPartitioner works simply as a modulo operator. 因此，在这种情况下， HashPartitioner只是作为模运算符。
You apply LIMIT before repartition so all values are shuffled to a single node first. 在重新分区之前应用LIMIT ，因此所有值都首先被洗牌到单个节点。 Therefore there is only one sequence of Int values used. 因此，只使用了一个Int值序列。
Number of partitions is a divisor of the size of the dataset. 分区数是数据集大小的除数。 Due to that data can be uniformly distributed among partitions. 由于数据可以在分区之间均匀分布。

Overall it is a combination of intended behavior (each partition should be uniformly distributed among output partitions), properties of pipeline (there is only one input partition) and the data (dataset can be uniformly distributed). 总的来说，它是预期行为的组合（每个分区应该在输出分区之间均匀分布），管道属性（只有一个输入分区）和数据（数据集可以均匀分布）。

在pyspark中，为什么`limit`后跟`repartition'创建完全相同的分区大小？

问题描述

1 个解决方案

解决方案1
4 2019-02-22 21:30:08

在pyspark中，为什么`limit`后跟`repartition&#39;创建完全相同的分区大小？

问题描述

1 个解决方案

解决方案1 4 2019-02-22 21:30:08

在pyspark中，为什么`limit`后跟`repartition'创建完全相同的分区大小？

解决方案1
4 2019-02-22 21:30:08