简体   繁体   中英

When and how does spark distribute partitions on executors

How does spark assign a partition to an executor.

When I ran the following lines in spark shell with 1 driver and 5 executors:

> var data = sc.textFile("file") // auto generates 2 partitions
> data.count()                   // materialize partitions on two nodes
> data = data.repartition(10)    // repartition data.count()            
> data.count()                   // 10 partitions still on original 2 nodes

After repartitioning, the 10 partitions still lies on the original two nodes (out of 5). This seems very inefficient, because 5 tasks are repeatedly run on each of the node containing the partitions instead of evenly distributed across nodes. The inefficiency is most obvious for iterative tasks which repeat many times on the same rdds.

So my question is, how does spark decide which node has which partition, and is there a way we can force data to be moved to other nodes?

I am just providing a guess here to show the logic (not necessarily what is really happening).

Lets assume your file is not really large, ie it fits inside 1 block of HDFS. And assume that block is replicated to 2 nodes. If you would want to do processing on a 3rd node then it would mean you need to copy it. Since count is a relatively fast calculation, it could be that the time it takes to process the task is relatively small. Spark might have decided that it would be better to wait and do the processing locally rather than shuffle the data to other nodes (You can configure this parameter).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM