简体   繁体   中英

Does Spark on yarn deal with Data locality while launching executors

I am considering static allocation of spark executor. Does Spark on yarn consider Data locality of raw input dataset getting used in spark application while launching executors.

If it does take care of this how it does so as spark executor are requested and allocated when spark context gets initialized. There could be a chance that multiple raw input data set getting used in the spark application which could physically reside on many different data node. we can't run executor on all those node.

I understand spark takes care of data locality while scheduling task on executor(as mentioned https://spark.apache.org/docs/latest/tuning.html#data-locality ).

You are correct in saying that

spark takes care of data locality while scheduling task on executor

When Yarn launches an executor, it has no idea where your data is. So,in an ideal case, you launch executor on all nodes of your cluster. However, more realistically you launch then on only a subset of nodes.

Now, this is not necessarily a bad thing because HDFS inherently supports redundancy which means chances are there is a copy of the data present on the node that spark has requested the data on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM