[英]In pyspark, why does `limit` followed by `repartition` create exactly equal partition sizes?
According to the pyspark documentation , repartition
is supposed to use hash partitioning, which would give slightly unequal partition sizes. 根据pyspark文档 ,
repartition
应该使用散列分区,这将给出稍微不相等的分区大小。 However, I have found that by preceding it with limit
, it will produce exactly equal partition sizes. 但是,我发现通过在
limit
之前加上它,它将产生完全相同的分区大小。 This can be shown by running the following in a pyspark shell: 这可以通过在pyspark shell中运行以下内容来显示:
df = spark.createDataFrame([range(5)] * 100)
def count_part_size(part_iter):
yield len(list(part_iter))
print(df.repartition(20).rdd.mapPartitions(count_part_size).collect())
# [4, 4, 4, 5, 4, 4, 5, 4, 5, 6, 6, 6, 7, 5, 5, 5, 5, 6, 5, 5]
print(df.limit(100).repartition(20).rdd.mapPartitions(count_part_size).collect())
# [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
If repartition
is using a hash partitioner, why would it produce exactly equal partition sizes in this case? 如果
repartition
使用散列分区程序,为什么在这种情况下它会产生完全相同的分区大小? And if it is not using a hash partitioner, what kind of partitioner is it using? 如果它没有使用散列分区程序,它使用什么样的分区程序?
By the way, I am using python version 2.7.15 and spark version 2.0.2 顺便说一下,我使用的是python版本2.7.15和spark版本2.0.2
There are four factors here: 这里有四个因素:
If no partitioning expression is provided, repartition
doesn't use HashPartitioning
, or to be specific, it doesn't use it directly. 如果未提供分区表达式,则
repartition
不使用HashPartitioning
,或者具体而言,它不直接使用它。 Instead it uses RoundRobinPartitioning
, which (as you can probably guess) 相反,它使用
RoundRobinPartitioning
, 它 (因为你可能已经猜到)
Distributes elements evenly across output partitions, starting from a random partition.
从随机分区开始,在输出分区之间均匀分配元素。
Internally, it generates a sequence of scala.Int
on each partition, starting from a random point . 在内部,它从每个分区生成一系列
scala.Int
, 从随机点开始 。 Only these values are passed through HashPartitioner
. 只有这些值通过
HashPartitioner
传递。
It works this way because Int
hashCode
is simply identity - in other words 它的工作方式是这样的,因为
Int
hashCode
只是身份 - 换句话说
∀x∈Int x = hashCode(x) ∀x∈Intx= hashCode(x)
(that's BTW the same behavior as of CPython hash
in the Scala Int
range - -2147483648 to 2147483647. These hashes are simply not designed to be cryptographically secure) As a result applying HashPartitioner
to a series of Int
values results in actual Round Robin assignment. (这是与Scala
Int
范围中的CPython hash
相同的行为 - -2147483648到2147483647.这些哈希根本不是设计为加密安全的)因此将HashPartitioner
应用于一系列Int
值导致实际的Round Robin分配。
So in such case HashPartitioner
works simply as a modulo operator. 因此,在这种情况下,
HashPartitioner
只是作为模运算符。
You apply LIMIT
before repartition so all values are shuffled to a single node first. 在重新分区之前应用
LIMIT
,因此所有值都首先被洗牌到单个节点。 Therefore there is only one sequence of Int
values used. 因此,只使用了一个
Int
值序列。
Number of partitions is a divisor of the size of the dataset. 分区数是数据集大小的除数。 Due to that data can be uniformly distributed among partitions.
由于数据可以在分区之间均匀分布。
Overall it is a combination of intended behavior (each partition should be uniformly distributed among output partitions), properties of pipeline (there is only one input partition) and the data (dataset can be uniformly distributed). 总的来说,它是预期行为的组合(每个分区应该在输出分区之间均匀分布),管道属性(只有一个输入分区)和数据(数据集可以均匀分布)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.