Spark with Hadoop Yarn : Use the entire cluster nodes

Question

I'm using Spark with HDFS Hadoop Storage and Yarn . My cluster contains 5 nodes (1 master and 4 slaves).

Master node : 48Gb RAM - 16 CPU Cores
Slave nodes : 12 Gb RAM - 16 CPU Cores

I'm executing two different processes : WordCount method and SparkSQL with two different files. All is working but I'm asking some question, maybe I don't understand very well Hadoop-Spark.

First example : WordCount

I executed WordCount function and I get result in two files (part-00000 and part-00001). The availability is slave4 and slave1 for part-00000 and slave3 and slave4 for part-00001.

Why not a part on slave2 ? Is it normal ?

When I look application_ID, I see the only 1 slave made the job :

Why my task is not well-distributed over my cluster ?

Second example : SparkSQL

In this case, I don't have saved file because I just want to return an SQL result, but only 1 slave node works too.

So why I have only 1 slave node which makes the task while I have a cluster which seems to work fine ?

The command line to execute this is :

time ./spark/bin/spark-submit --master yarn --deploy-mode cluster /home/valentin/SparkCount.py

Thank you !

Answer 1

spark.executor.instances defaults to 2

You need to increase this value to have more executors running at once

You can also tweak the cores and memory allocated to each executor. As far as I know, there is no magic formula.

If you want to not specify these values by hand. I might suggest reading the section on Speculative Execution in Spark documentation

Spark with Hadoop Yarn : Use the entire cluster nodes

Question

1 answers

solution1
1 2018-04-10 12:38:30

Spark with Hadoop Yarn : Use the entire cluster nodes

Question

1 answers

solution1 1 2018-04-10 12:38:30

solution1
1 2018-04-10 12:38:30