简体   繁体   English

RDD 中的分区数和 Spark 中的性能

[英]Number of partitions in RDD and performance in Spark

In Pyspark, I can create a RDD from a list and decide how many partitions to have:在 Pyspark 中,我可以从列表中创建一个 RDD 并决定有多少个分区:

sc = SparkContext()
sc.parallelize(xrange(0, 10), 4)

How does the number of partitions I decide to partition my RDD in influence the performance?我决定对 RDD 进行分区的分区数量如何影响性能? And how does this depend on the number of core my machine has?这如何取决于我的机器拥有的核心数量?

The primary effect would be by specifying too few partitions or far too many partitions.主要作用是通过指定太少分区或太多的分区。

Too few partitions You will not utilize all of the cores available in the cluster.分区太少您将无法利用集群中的所有可用核心。

Too many partitions There will be excessive overhead in managing many small tasks.太多的分区管理许多小任务会产生过多的开销。

Between the two the first one is far more impactful on performance.在两者之间,第一个对性能的影响要大得多。 Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.对于分区计数低于 1000 的情况,此时调度过多的 smalls 任务的影响相对较小。如果您有数万个分区,那么 spark 会变得非常慢。

To add to javadba's excellent answer, I recall the docs recommend to have your number of partitions set to 3 or 4 times the number of CPU cores in your cluster so that the work gets distributed more evenly among the available CPU cores.为了补充 javadba 的出色答案,我记得文档建议将分区数设置为集群中 CPU 内核数的 3 或 4 倍,以便在可用 CPU 内核之间更均匀地分配工作。 Meaning, if you only have 1 partition per CPU core in the cluster you will have to wait for the one longest running task to complete but if you had broken that down further the workload would be more evenly balanced with fast and slow running tasks evening out.这意味着,如果集群中的每个 CPU 核心只有 1 个分区,则您将不得不等待一个运行时间最长的任务完成,但如果您进一步细分,工作负载将更加均衡,快速和慢速运行的任务会在晚上结束.

Number of partition have high impact on spark's code performance.分区数对 Spark 的代码性能影响很大。

Ideally the spark partition implies how much data you want to shuffle.理想情况下,spark 分区意味着您想要洗牌的数据量。 Normally you should set this parameter on your shuffle size(shuffle read/write) and then you can set the number of partition as 128 to 256 MB per partition to gain maximum performance.通常你应该在你的 shuffle 大小(shuffle 读/写)上设置这个参数,然后你可以将分区数设置为每个分区 128 到 256 MB 以获得最大性能。

You can set partition in your spark sql code by setting the property as:您可以通过将属性设置为:

spark.sql.shuffle.partitions spark.sql.shuffle.partitions

or while using any dataframe you can set this by below:或者在使用任何数据框时,您可以通过以下方式进行设置:

df.repartition(numOfPartitions) df.repartition(numOfPartitions)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM