I'm new to Spark on YARN and don't understand the relation between the YARN Containers
and the Spark Executors
. I tried out the following configuration, based on the results of the yarn-utils.py
script, that can be used to find optimal cluster configuration.
The Hadoop cluster (HDP 2.4) I'm working on:
So I ran python yarn-utils.py -c 12 -m 64 -d 4 -k True
(c=cores, m=memory, d=hdds, k=hbase-installed) and got the following result:
Using cores=12 memory=64GB disks=4 hbase=True
Profile: cores=12 memory=49152MB reserved=16GB usableMem=48GB disks=4
Num Container=8
Container Ram=6144MB
Used Ram=48GB
Unused Ram=16GB
yarn.scheduler.minimum-allocation-mb=6144
yarn.scheduler.maximum-allocation-mb=49152
yarn.nodemanager.resource.memory-mb=49152
mapreduce.map.memory.mb=6144
mapreduce.map.java.opts=-Xmx4915m
mapreduce.reduce.memory.mb=6144
mapreduce.reduce.java.opts=-Xmx4915m
yarn.app.mapreduce.am.resource.mb=6144
yarn.app.mapreduce.am.command-opts=-Xmx4915m
mapreduce.task.io.sort.mb=2457
These settings I made via the Ambari interface and restarted the cluster. The values also match roughly what I calculated manually before.
I have now problems
spark-submit
script
--num-executors
, --executor-cores
& --executor-memory
. vcores
in YARN, here I couldn't find any useful examples yet However, I found this post What is a container in YARN? , but this didn't really help as it doesn't describe the relation to the executors.
Can someone help to solve one or more of the questions?
I will report my insights here step by step:
First important thing is this fact (Source: this Cloudera documentation ):
When running Spark on YARN, each Spark executor runs as a YARN container. [...]
This means the number of containers will always be the same as the executors created by a Spark application eg via --num-executors
parameter in spark-submit.
Set by the yarn.scheduler.minimum-allocation-mb
every container always allocates at least this amount of memory. This means if parameter --executor-memory
is set to eg only 1g
but yarn.scheduler.minimum-allocation-mb
is eg 6g
, the container is much bigger than needed by the Spark application.
The other way round, if the parameter --executor-memory
is set to somthing higher than the yarn.scheduler.minimum-allocation-mb
value, eg 12g
, the Container will allocate more memory dynamically, but only if the requested amount of memory is smaller or equal to yarn.scheduler.maximum-allocation-mb
value.
The value of yarn.nodemanager.resource.memory-mb
determines, how much memory can be allocated in sum by all containers of one host !
=> So setting yarn.scheduler.minimum-allocation-mb
allows you to run smaller containers eg for smaller executors ( else it would be waste of memory ).
=> Setting yarn.scheduler.maximum-allocation-mb
to the maximum value (eg equal to yarn.nodemanager.resource.memory-mb
) allows you to define bigger executors (more memory is allocated if needed, eg by --executor-memory
parameter).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.