简体   繁体   English

何时以及如何在执行程序上分配spark分区

[英]When and how does spark distribute partitions on executors

How does spark assign a partition to an executor. spark如何为执行程序分配分区。

When I ran the following lines in spark shell with 1 driver and 5 executors: 当我使用1个驱动程序和5个执行程序在spark shell中运行以下行时:

> var data = sc.textFile("file") // auto generates 2 partitions
> data.count()                   // materialize partitions on two nodes
> data = data.repartition(10)    // repartition data.count()            
> data.count()                   // 10 partitions still on original 2 nodes

After repartitioning, the 10 partitions still lies on the original two nodes (out of 5). 重新分区后,10个分区仍位于原始的两个节点上(5个中)。 This seems very inefficient, because 5 tasks are repeatedly run on each of the node containing the partitions instead of evenly distributed across nodes. 这看起来非常低效,因为在包含分区的每个节点上重复运行5个任务,而不是在节点之间均匀分布。 The inefficiency is most obvious for iterative tasks which repeat many times on the same rdds. 对于在相同的rdds上重复多次的迭代任务,效率最低是最明显的。

So my question is, how does spark decide which node has which partition, and is there a way we can force data to be moved to other nodes? 所以我的问题是,spark如何决定哪个节点具有哪个分区,有没有办法可以强制数据移动到其他节点?

I am just providing a guess here to show the logic (not necessarily what is really happening). 我只是在这里提供一个猜测来显示逻辑(不一定是真正发生的事情)。

Lets assume your file is not really large, ie it fits inside 1 block of HDFS. 让我们假设您的文件不是很大,即它适合1块HDFS。 And assume that block is replicated to 2 nodes. 并假设该块被复制到2个节点。 If you would want to do processing on a 3rd node then it would mean you need to copy it. 如果您想在第三个节点上进行处理,那么这意味着您需要复制它。 Since count is a relatively fast calculation, it could be that the time it takes to process the task is relatively small. 由于计数是一个相对较快的计算,因此处理任务所需的时间可能相对较小。 Spark might have decided that it would be better to wait and do the processing locally rather than shuffle the data to other nodes (You can configure this parameter). Spark可能已经决定最好等待并在本地进行处理,而不是将数据混洗到其他节点(您可以配置此参数)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当分区数与执行器数不匹配时,如何处理Spark执行器? - How to handle Spark Executors when number of partitions do not match no of Executors? spark如何将分区分配给executor - How spark distributes partitions to executors Spark 执行器、任务和分区 - Spark executors, tasks and partitions 如何有效地分布和使用Spark中的分区? - How to efficiently distribute and use partitions in spark? Spark执行程序,分区内存不足 - Spark executors,partitions out of memory 如何单独处理Kafka分区并与Spark执行器并行处理? - How to process Kafka partitions separately and in parallel with Spark executors? 如果没有,将如何处理Spark RDD分区。 遗嘱执行人&lt;no。 RDD分区 - How Spark RDD partitions are processed if no. of executors < no. of RDD partition spark如何将培训任务平均分配给执行者? - how spark distribute training tasks to evenly across executors? Apache Spark:执行程序可以在 spark 中保存多少个分区。? 分区在执行者之间是如何分布(机制)的? - Apache Spark: How many partitions can a executor hold in spark.? How are the partitions distributed (mechanism) among the executors? Spark分区 - 使用DISTRIBUTE BY选项 - Spark partitions - using DISTRIBUTE BY option
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM