I'm using Spark
with HDFS Hadoop
Storage and Yarn
. My cluster contains 5 nodes (1 master and 4 slaves).
I'm executing two different processes : WordCount method and SparkSQL with two different files. All is working but I'm asking some question, maybe I don't understand very well Hadoop-Spark.
First example : WordCount
I executed WordCount function and I get result in two files (part-00000 and part-00001). The availability is slave4 and slave1 for part-00000 and slave3 and slave4 for part-00001.
Why not a part on slave2 ? Is it normal ?
When I look application_ID, I see the only 1 slave made the job :
Why my task is not well-distributed over my cluster ?
Second example : SparkSQL
In this case, I don't have saved file because I just want to return an SQL result, but only 1 slave node works too.
So why I have only 1 slave node which makes the task while I have a cluster which seems to work fine ?
The command line to execute this is :
time ./spark/bin/spark-submit --master yarn --deploy-mode cluster /home/valentin/SparkCount.py
Thank you !
spark.executor.instances
defaults to 2
You need to increase this value to have more executors running at once
You can also tweak the cores and memory allocated to each executor. As far as I know, there is no magic formula.
If you want to not specify these values by hand. I might suggest reading the section on Speculative Execution in Spark documentation
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.