简体   繁体   English

Spark独立模式不将作业分发到其他工作程序节点

[英]Spark standalone mode not distributing job to other worker node

I am running a spark job in standalone mode. 我在独立模式下运行Spark工作。 I have configured my worker node to connect to master node. 我已经将我的工作程序节点配置为连接到主节点。 They are getting connected successfully, but when I am running the job on spark master the job is not getting distributed. 他们已成功建立连接,但是当我在spark master上运行作业时,该作业没有分发。 I keep on getting the following message- 我不断收到以下消息-

WARN TaskSchedulerImpl: Initial job has not accepted any resources; WARN TaskSchedulerImpl:初始作业未接受任何资源; check your cluster UI to ensure that workers are registered and have sufficient resources 检查您的集群用户界面,以确保工作人员已注册并拥有足够的资源

I have tried to run the job as local on the worker node and its running fine which means resources are available. 我试图在工作节点上以本地方式运行作业,并且运行良好,这意味着资源可用。 Also the spark master-ui is showing that the worker has accepted the job.Password less ssh is enabled in both master and worker node to and fro. spark master-ui也显示工作程序已接受该工作。来回的master和worker节点均启用了passwordless ssh。 I feel it might be some firewall issue or may be spark driver port is not opened. 我感觉可能是某些防火墙问题,或者可能是未打开Spark驱动程序端口。 My worker node logs show- 我的工作节点日志显示-

16/03/21 10:05:40 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-7-oracle/bin/java" "-cp" "/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/sbin/../conf/:/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar:/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar" "-Xms8192M" "-Xmx8192M" "-Dspark.driver.port=51810" "-Dspark.cassandra.connection.port=9042" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver@10.0.1.192:51810/user/CoarseGrainedScheduler" "--executor-id" "2" "--hostname" "10.0.1.194" "--cores" "4" "--app-id" "app-20160321100135-0001" "--worker-url" "akka.tcp://sparkWorker@10.0.1.194:39423/user/Worker" 16/03/21 10:05:40 INFO ExecutorRunner:启动命令:“ / usr / lib / jvm / java-7-oracle / bin / java”“-cp”“ / mnt / pd1 / spark / spark-1.5。 0-bin-hadoop2.6 / sbin /../ conf /:/ mnt / pd1 / spark / spark-1.5.0-bin-hadoop2.6 / lib / spark-assembly-1.5.0-hadoop2.6.0.jar :/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6 /lib/datanucleus-api-jdo-3.2.6.jar:/mnt/pd1/spark/spark-1.5.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar“” -Xms8192M“ “ -Xmx8192M”“ -Dspark.driver.port = 51810”“ -Dspark.cassandra.connection.port = 9042”“ -XX:MaxPermSize = 256m”“ org.apache.spark.executor.CoarseGrainedExecutorBackend”“-驱动程序url“” akka.tcp://sparkDriver@10.0.1.192:51810 / user / CoarseGrainedScheduler“” --executor-id“” 2“” --hostname“” 10.0.1.194“” --cores“” 4“” --app-id“” app-20160321100135-0001“” --worker-url“” akka.tcp://sparkWorker@10.0.1.194:39423 / user / Worker“

Executor at worker nodes shows the following log in stderr - 工作节点上的执行程序在stderr中显示以下日志-

16/03/21 10:13:52 INFO Slf4jLogger: Slf4jLogger started 16/03/21 10:13:52 INFO Remoting: Starting remoting 16/03/21 10:13:52 INFO Remoting: Remoting started; 16/03/21 10:13:52 INFO Slf4jLogger:Slf4jLogger已启动16/03/21 10:13:52 INFO远程处理:开始远程处理16/03/21 10:13:52 INFO远程处理:远程处理已开始; listening on addresses :[akka.tcp://driverPropsFetcher@10.0.1.194:59715] 16/03/21 10:13:52 INFO Utils: Successfully started service 'driverPropsFetcher' on port 59715. 正在侦听地址:[akka.tcp://driverPropsFetcher@10.0.1.194:59715] 16/03/21 10:13:52信息实用程序:已成功在端口59715上启动服务“ driverPropsFetcher”。

You can specifiy a specific driver port within the Spark context: 您可以在Spark上下文中指定特定的驱动程序端口:

spark.driver.port  = "port"
val conf = new SparkConf().set("spark.driver.port", "51810") 

PS: When manually starting the spark worker on the worker machine and connect it to the Master, you don t need any further passless authentication or similar between master and spark. This would only be necessarry if you use the Master for starting all slaves (start-slaves.sh). So this shouldn PS:在工作机上手动启动spark worker并将其连接到Master时, t need any further passless authentication or similar between master and spark. This would only be necessarry if you use the Master for starting all slaves (start-slaves.sh). So this shouldn t need any further passless authentication or similar between master and spark. This would only be necessarry if you use the Master for starting all slaves (start-slaves.sh). So this shouldn t need any further passless authentication or similar between master and spark. This would only be necessarry if you use the Master for starting all slaves (start-slaves.sh). So this shouldn t be a problem. t need any further passless authentication or similar between master and spark. This would only be necessarry if you use the Master for starting all slaves (start-slaves.sh). So this shouldn是一个问题。

Many people have the issue when setting up new clusters. 设置新集群时,很多人都会遇到问题。 If you can find spark slaves in the web UI but they are not accepting jobs, there is a high chance that the firewall is blocking the communication. 如果您可以在Web UI中找到Spark Slave,但它们不接受作业,则防火墙很有可能阻止了通信。 Take a look at my other answer: Apache Spark on Mesos: Initial job has not accepted any resources : 看看我的其他答案: Mesos上的Apache Spark:初始作业未接受任何资源

While most of other answers focuses on resource allocation (cores, memory) on spark slaves, I would like to highlight that firewall could cause exactly the same issue, especially when you are running spark on cloud platforms. 尽管其他大多数答案都集中在spark从站上的资源分配(内核,内存),但我想强调一下防火墙可能引起完全相同的问题,尤其是在云平台上运行spark时。

If you can find spark slaves in the web UI, you have probably opened the standard ports 8080, 8081, 7077, 4040. Nonetheless, when you actually run a job, it uses SPARK_WORKER_PORT, spark.driver.port and spark.blockManager.port which by default are randomly assigned. 如果可以在Web UI中找到spark从属,则可能已经打开了标准端口8080、8081、7077、4040。但是,当您实际运行作业时,它会使用SPARK_WORKER_PORT,spark.driver.port和spark.blockManager.port。默认情况下是随机分配的。 If your firewall is blocking these ports, the master could not retrieve any job-specific response from slaves and return the error. 如果防火墙阻止了这些端口,则主服务器无法从从服务器检索任何特定于作业的响应并返回错误。

You can run a quick test by opening all the ports and see whether the slave accepts jobs. 您可以通过打开所有端口并查看从站是否接受作业来运行快速测试。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark 0.9.0:当作业失败时,工作人员在独立模式下继续死亡 - Spark 0.9.0: worker keeps dying in standalone mode when job fails Spark独立模式:工作者无法在cloudera中正常启动 - Spark Standalone Mode: Worker not starting properly in cloudera 独立集群中Spark应用程序的Spark驱动程序节点和辅助节点 - Spark driver node and worker node for a Spark application in Standalone cluster 在Spark独立模式下运行Spark Job Server时发生异常 - Exception when running Spark job server in spark standalone mode Spark Worker - 在独立模式下更改 web ui 主机 - Spark Worker - Change web ui host in standalone mode 如何更改每个工作程序在Spark Standalone模式下使用的CPU数量? - How to change the number of CPUs each worker uses in Spark Standalone mode? Spark 2.0独立模式动态资源分配工作者启动错误 - Spark 2.0 Standalone mode Dynamic Resource Allocation Worker Launch Error Spark作业在本地运行时有效,但在独立模式下不起作用 - Spark job works when running locally but not working when on standalone mode Spark:如何在独立模式下设置特定于工人的SPARK_HOME - Spark: how to set worker-specific SPARK_HOME in standalone mode Spark 独立模式:有没有办法以编程方式从 Spark 的 localhost:8080 获取每个工作人员的内核/内存信息 - Spark Standalone Mode: Is there a way to programmatically get cores/memory information for each worker from Spark's localhost:8080
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM