I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.
Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR
or YARN_CONF_DIR
points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?
Adding to other answers.
- Is it necessary that spark is installed on all the nodes in yarn cluster?
No , If the spark job is scheduling in YARN(either client
or cluster
mode). Spark installation needed in many nodes only for standalone mode .
These are the visualisations of spark app deployment modes.
Spark Standalone Cluster
In cluster
mode driver will be sitting in one of the Spark Worker node whereas in client
mode it will be within the machine which launched the job .
YARN cluster mode
YARN client mode
This table offers a concise list of differences between these modes:
- It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?
Hadoop installation is not mandatory but configurations (not all) are!. We can call them as Gateway nodes . It's for two main reasons.
HADOOP_CONF_DIR
directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. yarn-default.xml
). Thus, the --master
parameter is yarn
.Spark 2.0+ no longer requires a fat assembly jar for production deployment. source
We are running spark jobs on YARN (we use HDP 2.2).
We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.
For example to run the Pi example:
./bin/spark-submit \
--verbose \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 4 \
hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar
- This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit
.
About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.
1 - Spark if following s slave/master architecture. So on your cluster, you have to install a spark master and N spark slaves. You can run spark in a standalone mode. But using Yarn architecture will give you some benefits. There is a very good explanation of it here : http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
2- It is necessary if you want to use Yarn or HDFS for example, but as i said before you can run it in standalone mode.
Let me try to cut glues and make it short for impatient.
6 components : 1. client, 2. driver, 3. executors, 4. application master, 5. workers, and 6. resource manager; 2 deploy modes ; and 2 resource (cluster) management .
Here's the relation:
Nothing special, is the one submitting spark app.
Nothing special, one worker holds one or more executors.
(no matter client or cluster mode)
Voilà!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.