简体   繁体   English

在Docker之上的Spark不接受工作

[英]Spark atop of Docker not accepting jobs

I'm trying to make a hello world example work with spark+docker, and here is my code. 我正在尝试使用spark + docker创建一个hello world示例,这是我的代码。

object Generic {
  def main(args: Array[String]) {
    val sc = new SparkContext("spark://172.17.0.3:7077", "Generic", "/opt/spark-0.9.0")

    val NUM_SAMPLES = 100000
    val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
      val x = Math.random * 2 - 1
      val y = Math.random * 2 - 1
      if (x * x + y * y < 1) 1.0 else 0.0
    }.reduce(_ + _)

    println("Pi is roughly " + 4 * count / NUM_SAMPLES)
  }
}

When I run sbt run , I get 当我运行sbt run ,我得到

14/05/28 15:19:58 INFO client.AppClient$ClientActor: Connecting to master spark://172.17.0.3:7077...
14/05/28 15:20:08 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

I checked both the cluster UI, where I have 3 nodes that each have 1.5g of memory, and the namenode UI, where I see the same thing. 我同时检查了群集UI和namenode UI,在该群集UI中我有3个节点,每个节点有1.5g的内存,在其中我看到了相同的东西。

The docker logs show no output from the workers and the following from the master 码头工人日志显示工人没有输出,而主人则没有输出

14/05/28 21:20:38 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://spark@10.0.3.1:48085]: Error [Association failed with [akka.tcp://spark@10.0.3.1:48085]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@10.0.3.1:48085]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /10.0.3.1:48085

] ]

This happens a couple times, and then the program times out and dies with 这会发生几次,然后程序超时并死于

[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Spark cluster looks down

When I did a tcpdump over the docker0 interface, and it looks like the workers and the master nodes are talking. 当我在docker0接口上执行tcpdump时,工作人员和主节点似乎正在交谈。

However, the spark console works. 但是,火花控制台可以工作。

If I set sc as val sc = new SparkContext("local", "Generic", System.getenv("SPARK_HOME")) , the program runs 如果将sc设置为val sc = new SparkContext("local", "Generic", System.getenv("SPARK_HOME")) ,程序将运行

I've been there. 我去过那儿。 The issue looks like the AKKA actor subsystem in Spark is binding on a different interface than Spark on docker0. 该问题看起来像Spark中的AKKA actor子系统绑定的接口与docker0上的Spark绑定的接口不同。

While your master ip is on: spark://172.17.0.3:7077 当您的主IP开启时: spark://172.17.0.3:7077

Akka is binding on: akka.tcp://spark@10.0.3.1:48085 Akka绑定到: akka.tcp://spark@10.0.3.1:48085

If you masters/slaves are docker containers, they should be communicating through the docker0 interface in the 172.17.xx range. 如果您的主/从服务器是docker容器,则它们应该通过172.17.xx范围内的docker0接口进行通信。

Try providing the master and slaves with their correct local IP using the env config SPARK_LOCAL_IP . 尝试使用env config SPARK_LOCAL_IP为主机和从机提供正确的本地IP。 See config docs for details. 有关详细信息,请参见配置文档

In our docker setup for Spark 0.9 we are using this command to start the slaves: 在Spark 0.9的Docker设置中,我们使用以下命令启动从站:

${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_IP -i $LOCAL_IP 

Which directly provides the local IP to the worker. 直接向工作人员提供本地IP。

For running spark on Docker it's crucial to 为了在Docker上运行火花,至关重要的是

  1. Expose all necessary ports 暴露所有必要的端口
  2. Set correct spark.broadcast.factory 设置正确的spark.broadcast.factory
  3. Handle docker aliases 处理Docker别名

Without handling all 3 issues spark cluster parts(master, worker, driver) can't communicate. 如果不处理所有3个问题,Spark群集部件(主,工作人员,驱动程序)将无法通信。 You can read closely on every issue on http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html or use container ready for spark from https://registry.hub.docker.com/u/epahomov/docker-spark/ 您可以在http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html上仔细阅读每个问题,也可以使用可从https://registry.hub.docker发出火花的容器。 com / u / epahomov / docker-spark /

You have to check firewall if you are on Windows host and make sure java.exe is allowed to access the public network or change dockerNAT to private. 如果您在Windows主机上,则必须检查防火墙,并确保允许java.exe访问公共网络或将dockerNAT更改为private。 In general, the worker must be able to connect back to the driver (the program you submitted). 通常,工作人员必须能够连接回驱动程序(您提交的程序)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM