简体   繁体   中英

Spark with Hadoop Yarn : Use the entire cluster nodes

I'm using Spark with HDFS Hadoop Storage and Yarn . My cluster contains 5 nodes (1 master and 4 slaves).

  • Master node : 48Gb RAM - 16 CPU Cores
  • Slave nodes : 12 Gb RAM - 16 CPU Cores

I'm executing two different processes : WordCount method and SparkSQL with two different files. All is working but I'm asking some question, maybe I don't understand very well Hadoop-Spark.

First example : WordCount

I executed WordCount function and I get result in two files (part-00000 and part-00001). The availability is slave4 and slave1 for part-00000 and slave3 and slave4 for part-00001.

Why not a part on slave2 ? Is it normal ?

When I look application_ID, I see the only 1 slave made the job :

在此处输入图片说明

Why my task is not well-distributed over my cluster ?

Second example : SparkSQL

In this case, I don't have saved file because I just want to return an SQL result, but only 1 slave node works too.

So why I have only 1 slave node which makes the task while I have a cluster which seems to work fine ?

The command line to execute this is :

time ./spark/bin/spark-submit --master yarn --deploy-mode cluster /home/valentin/SparkCount.py

Thank you !

spark.executor.instances defaults to 2

You need to increase this value to have more executors running at once

You can also tweak the cores and memory allocated to each executor. As far as I know, there is no magic formula.

If you want to not specify these values by hand. I might suggest reading the section on Speculative Execution in Spark documentation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM