Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

Question

I'm new to Spark on YARN and don't understand the relation between the YARN Containers and the Spark Executors . I tried out the following configuration, based on the results of the yarn-utils.py script, that can be used to find optimal cluster configuration.

The Hadoop cluster (HDP 2.4) I'm working on:

1 Master Node:
- CPU: 2 CPUs with 6 cores each = 12 cores
- RAM: 64 GB
- SSD: 2 x 512 GB
5 Slave Nodes:
- CPU: 2 CPUs with 6 cores each = 12 cores
- RAM: 64 GB
- HDD: 4 x 3 TB = 12 TB
HBase is installed (this is one of the parameters for the script below)

So I ran python yarn-utils.py -c 12 -m 64 -d 4 -k True (c=cores, m=memory, d=hdds, k=hbase-installed) and got the following result:

 Using cores=12 memory=64GB disks=4 hbase=True
 Profile: cores=12 memory=49152MB reserved=16GB usableMem=48GB disks=4
 Num Container=8
 Container Ram=6144MB
 Used Ram=48GB
 Unused Ram=16GB
 yarn.scheduler.minimum-allocation-mb=6144
 yarn.scheduler.maximum-allocation-mb=49152
 yarn.nodemanager.resource.memory-mb=49152
 mapreduce.map.memory.mb=6144
 mapreduce.map.java.opts=-Xmx4915m
 mapreduce.reduce.memory.mb=6144
 mapreduce.reduce.java.opts=-Xmx4915m
 yarn.app.mapreduce.am.resource.mb=6144
 yarn.app.mapreduce.am.command-opts=-Xmx4915m
 mapreduce.task.io.sort.mb=2457

These settings I made via the Ambari interface and restarted the cluster. The values also match roughly what I calculated manually before.

I have now problems

to find the optimal settings for my spark-submit script
- parameters --num-executors , --executor-cores & --executor-memory .
to get the relation between the YARN container and the Spark executors
to understand the hardware information in my Spark History UI (less memory shown as I set (when calculated to overall memory by multiplying with worker node amount))
to understand the concept of the vcores in YARN, here I couldn't find any useful examples yet

However, I found this post What is a container in YARN? , but this didn't really help as it doesn't describe the relation to the executors.

Can someone help to solve one or more of the questions?

Answer 1

I will report my insights here step by step:

First important thing is this fact (Source: this Cloudera documentation ):

When running Spark on YARN, each Spark executor runs as a YARN container. [...]
This means the number of containers will always be the same as the executors created by a Spark application eg via --num-executors parameter in spark-submit.
Set by the yarn.scheduler.minimum-allocation-mb every container always allocates at least this amount of memory. This means if parameter --executor-memory is set to eg only 1g but yarn.scheduler.minimum-allocation-mb is eg 6g , the container is much bigger than needed by the Spark application.
The other way round, if the parameter --executor-memory is set to somthing higher than the yarn.scheduler.minimum-allocation-mb value, eg 12g , the Container will allocate more memory dynamically, but only if the requested amount of memory is smaller or equal to yarn.scheduler.maximum-allocation-mb value.
The value of yarn.nodemanager.resource.memory-mb determines, how much memory can be allocated in sum by all containers of one host !

=> So setting yarn.scheduler.minimum-allocation-mb allows you to run smaller containers eg for smaller executors ( else it would be waste of memory ).

=> Setting yarn.scheduler.maximum-allocation-mb to the maximum value (eg equal to yarn.nodemanager.resource.memory-mb ) allows you to define bigger executors (more memory is allocated if needed, eg by --executor-memory parameter).

Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

Question

1 answers

solution1
21 ACCPTED 2016-07-13 09:49:37

Spark on YARN resource manager: Relation between YARN Containers and Spark Executors

Question

1 answers

solution1 21 ACCPTED 2016-07-13 09:49:37

solution1
21 ACCPTED 2016-07-13 09:49:37