简体   繁体   中英

Spark standalone mode not distributing job to other worker node

I am running a spark job in standalone mode. I have configured my worker node to connect to master node. They are getting connected successfully, but when I am running the job on spark master the job is not getting distributed. I keep on getting the following message-

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I have tried to run the job as local on the worker node and its running fine which means resources are available. Also the spark master-ui is showing that the worker has accepted the job.Password less ssh is enabled in both master and worker node to and fro. I feel it might be some firewall issue or may be spark driver port is not opened. My worker node logs show-

16/03/21 10:05:40 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-7-oracle/bin/java" "-cp" "/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/sbin/../conf/:/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar:/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar" "-Xms8192M" "-Xmx8192M" "-Dspark.driver.port=51810" "-Dspark.cassandra.connection.port=9042" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver@10.0.1.192:51810/user/CoarseGrainedScheduler" "--executor-id" "2" "--hostname" "10.0.1.194" "--cores" "4" "--app-id" "app-20160321100135-0001" "--worker-url" "akka.tcp://sparkWorker@10.0.1.194:39423/user/Worker"

Executor at worker nodes shows the following log in stderr -

16/03/21 10:13:52 INFO Slf4jLogger: Slf4jLogger started 16/03/21 10:13:52 INFO Remoting: Starting remoting 16/03/21 10:13:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@10.0.1.194:59715] 16/03/21 10:13:52 INFO Utils: Successfully started service 'driverPropsFetcher' on port 59715.

You can specifiy a specific driver port within the Spark context:

spark.driver.port  = "port"
val conf = new SparkConf().set("spark.driver.port", "51810") 

PS: When manually starting the spark worker on the worker machine and connect it to the Master, you don t need any further passless authentication or similar between master and spark. This would only be necessarry if you use the Master for starting all slaves (start-slaves.sh). So this shouldn t need any further passless authentication or similar between master and spark. This would only be necessarry if you use the Master for starting all slaves (start-slaves.sh). So this shouldn t need any further passless authentication or similar between master and spark. This would only be necessarry if you use the Master for starting all slaves (start-slaves.sh). So this shouldn t be a problem.

Many people have the issue when setting up new clusters. If you can find spark slaves in the web UI but they are not accepting jobs, there is a high chance that the firewall is blocking the communication. Take a look at my other answer: Apache Spark on Mesos: Initial job has not accepted any resources :

While most of other answers focuses on resource allocation (cores, memory) on spark slaves, I would like to highlight that firewall could cause exactly the same issue, especially when you are running spark on cloud platforms.

If you can find spark slaves in the web UI, you have probably opened the standard ports 8080, 8081, 7077, 4040. Nonetheless, when you actually run a job, it uses SPARK_WORKER_PORT, spark.driver.port and spark.blockManager.port which by default are randomly assigned. If your firewall is blocking these ports, the master could not retrieve any job-specific response from slaves and return the error.

You can run a quick test by opening all the ports and see whether the slave accepts jobs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM