[英]When and how does spark distribute partitions on executors
How does spark assign a partition to an executor. spark如何为执行程序分配分区。
When I ran the following lines in spark shell with 1 driver and 5 executors: 当我使用1个驱动程序和5个执行程序在spark shell中运行以下行时:
> var data = sc.textFile("file") // auto generates 2 partitions
> data.count() // materialize partitions on two nodes
> data = data.repartition(10) // repartition data.count()
> data.count() // 10 partitions still on original 2 nodes
After repartitioning, the 10 partitions still lies on the original two nodes (out of 5). 重新分区后,10个分区仍位于原始的两个节点上(5个中)。 This seems very inefficient, because 5 tasks are repeatedly run on each of the node containing the partitions instead of evenly distributed across nodes.
这看起来非常低效,因为在包含分区的每个节点上重复运行5个任务,而不是在节点之间均匀分布。 The inefficiency is most obvious for iterative tasks which repeat many times on the same rdds.
对于在相同的rdds上重复多次的迭代任务,效率最低是最明显的。
So my question is, how does spark decide which node has which partition, and is there a way we can force data to be moved to other nodes? 所以我的问题是,spark如何决定哪个节点具有哪个分区,有没有办法可以强制数据移动到其他节点?
I am just providing a guess here to show the logic (not necessarily what is really happening). 我只是在这里提供一个猜测来显示逻辑(不一定是真正发生的事情)。
Lets assume your file is not really large, ie it fits inside 1 block of HDFS. 让我们假设您的文件不是很大,即它适合1块HDFS。 And assume that block is replicated to 2 nodes.
并假设该块被复制到2个节点。 If you would want to do processing on a 3rd node then it would mean you need to copy it.
如果您想在第三个节点上进行处理,那么这意味着您需要复制它。 Since count is a relatively fast calculation, it could be that the time it takes to process the task is relatively small.
由于计数是一个相对较快的计算,因此处理任务所需的时间可能相对较小。 Spark might have decided that it would be better to wait and do the processing locally rather than shuffle the data to other nodes (You can configure this parameter).
Spark可能已经决定最好等待并在本地进行处理,而不是将数据混洗到其他节点(您可以配置此参数)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.