当我使用带有spark / scala的window.partitionBy（）函数时，如何保持partition'number不变？

Question

I have a RDD , the RDD' partition of result changes to 200 when I use window ,can I not change partition when I use window ? 我有一个RDD中， RDD' partition的结果变为200，当我使用window ，我不能改变partition ，当我使用window ？

This is my code: 这是我的代码：

val rdd= sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val result = rdd.toDF("values").withColumn("csum", sum(col("values")).over(Window.partitionBy(col("values")))).rdd
println(result.getNumPartitions + "rdd2")

My input partition is 4, why result partition is 200? 我的输入分区是4，为什么结果分区是200？

I want my result partition to be also 4. 我希望我的结果分区也是4。

Is there any cleaner solution? 有没有更清洁的解决方案？

Answer 1

Note: As mentioned by @eliasah - it's not possible to avoid repartition when using window functions with spark 注意：如@eliasah所述 - 当使用带有spark的窗口函数时，无法避免重新分区

Why result partition is 200? 为什么结果分区是200？

Spark doc The default value of spark.sql.shuffle.partitions which Configures the number of partitions to use when shuffling data for joins or aggregations - is 200 Spark doc spark.sql.shuffle.partitions的默认值，用于配置在为连接或聚合重排数据时使用的分区数 - 是200

How can I repartition to 4? 我怎样才能重新分配到4？

You can use: 您可以使用：

coalesce(4)

or 要么

repartition(4)

spark doc 火花博士

coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. coalesce（numPartitions）将RDD中的分区数减少为numPartitions。 Useful for running operations more efficiently after filtering down a large dataset. 过滤大型数据集后，可以更有效地运行操作。

repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. repartition（numPartitions）随机重新调整RDD中的数据以创建更多或更少的分区并在它们之间进行平衡。 This always shuffles all data over the network. 这总是随机播放网络上的所有数据。

Answer 2

(also added this answer to https://stackoverflow.com/a/44384638/3415409 ) （还将此答案添加到https://stackoverflow.com/a/44384638/3415409 ）

I was just reading about controlling the number of partitions when using groupBy aggregation, from https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-performance-tuning-groupBy-aggregation.html , it seems the same trick works with Window, in my code I'm defining a window like 我只是在阅读有关使用groupBy聚合时控制分区数量的文章，来自https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-performance-tuning-groupBy-aggregation.html ，看起来是一样的诀窍适用于Window，在我的代码中我定义了一个窗口

windowSpec = Window \
    .partitionBy('colA', 'colB') \
    .orderBy('timeCol') \
    .rowsBetween(1, 1)

and then doing 然后做

next_event = F.lead('timeCol', 1).over(windowSpec)

and creating a dataframe via 并通过创建数据帧

df2 = df.withColumn('next_event', next_event)

and indeed, it has 200 partitions. 事实上，它有200个分区。 But, if I do 但是，如果我这样做

df2 = df.repartition(10, 'colA', 'colB').withColumn('next_event', next_event)

it has 10! 它有10个！

当我使用带有spark / scala的window.partitionBy（）函数时，如何保持partition'number不变？

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-06-07 08:07:27

解决方案2
2 2018-08-26 22:50:46

当我使用带有spark / scala的window.partitionBy（）函数时，如何保持partition'number不变？

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-06-07 08:07:27

解决方案2 2 2018-08-26 22:50:46

解决方案1
4 已采纳 2017-06-07 08:07:27

解决方案2
2 2018-08-26 22:50:46