简体   繁体   中英

Spark 0.9.0: worker keeps dying in standalone mode when job fails

I am new to spark. I am running Spark in standalone mode on my mac. I bring up the master and the worker and they all come up fine. The log file of the master looks like:

...
14/02/25 18:52:43 INFO Slf4jLogger: Slf4jLogger started
14/02/25 18:52:43 INFO Remoting: Starting remoting
14/02/25 18:52:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@Shirishs-MacBook-Pro.local:7077]
14/02/25 18:52:43 INFO Master: Starting Spark master at spark://Shirishs-MacBook-Pro.local:7077
14/02/25 18:52:43 INFO MasterWebUI: Started Master web UI at http://192.168.1.106:8080
14/02/25 18:52:43 INFO Master: I have been elected leader! New state: ALIVE
14/02/25 18:53:03 INFO Master: Registering worker Shirishs-MacBook-Pro.local:53956 with 4 cores, 15.0 GB RAM

The worker log looks like:

14/02/25 18:53:02 INFO Slf4jLogger: Slf4jLogger started
14/02/25 18:53:02 INFO Remoting: Starting remoting
14/02/25 18:53:02 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker@192.168.1.106:53956]
14/02/25 18:53:02 INFO Worker: Starting Spark worker 192.168.1.106:53956 with 4 cores, 15.0 GB RAM
14/02/25 18:53:02 INFO Worker: Spark home: /Users/shirish_kumar/Developer/spark-0.9.0-incubating
14/02/25 18:53:02 INFO WorkerWebUI: Started Worker web UI at http://192.168.1.106:8081
14/02/25 18:53:02 INFO Worker: Connecting to master spark://Shirishs-MacBook-Pro.local:7077...
14/02/25 18:53:03 INFO Worker: Successfully registered with master spark://Shirishs-MacBook-Pro.local:7077

Now, when I submit a job, the job fails to execute (because class not found error) but the worker also dies. Here is the master log:

14/02/25 18:55:52 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
14/02/25 18:55:52 INFO Master: Launching driver driver-20140225185552-0000 on worker worker-20140225185302-192.168.1.106-53956
14/02/25 18:55:55 INFO Master: Registering worker Shirishs-MacBook-Pro.local:53956 with 4 cores, 15.0 GB RAM
14/02/25 18:55:55 INFO Master: Attempted to re-register worker at same address: akka.tcp://sparkWorker@192.168.1.106:53956
14/02/25 18:55:55 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:55:57 INFO Master: akka.tcp://driverClient@192.168.1.106:53961 got disassociated, removing it.
14/02/25 18:55:57 INFO Master: akka.tcp://driverClient@192.168.1.106:53961 got disassociated, removing it.
14/02/25 18:55:57 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40192.168.1.106%3A53962-2#-21389169] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
4/02/25 18:55:57 INFO Master: akka.tcp://driverClient@192.168.1.106:53961 got disassociated, removing it.

14/02/25 18:55:57 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@Shirishs-MacBook-Pro.local:7077] -> [akka.tcp://driverClient@192.168.1.106:53961]: Error [Association failed with [akka.tcp://driverClient@192.168.1.106:53961]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://driverClient@192.168.1.106:53961]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /192.168.1.106:53961
] 
...
...
14/02/25 18:55:57 INFO Master: akka.tcp://driverClient@192.168.1.106:53961 got disassociated, removing it.
14/02/25 18:56:03 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:56:10 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:56:18 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:56:25 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:56:33 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:56:40 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/ 

The worker log looks like this

14/02/25 18:55:52 INFO Worker: Asked to launch driver driver-20140225185552-0000
2014-02-25 18:55:52.534 java[11415:330b] Unable to load realm info from SCDynamicStore
14/02/25 18:55:52 INFO DriverRunner: Copying user jar file:/Users/shirish_kumar/Developer/spark_app/SimpleApp to /Users/shirish_kumar/Developer/spark-0.9.0-incubating/work/driver-20140225185552-0000/SimpleApp
14/02/25 18:55:53 INFO DriverRunner: Launch Command: "/Library/Java/JavaVirtualMachines/jdk1.7.0_40.jdk/Contents/Home/bin/java" "-cp" ":/Users/shirish_kumar/Developer/spark-0.9.0-incubating/work/driver-20140225185552-0000/SimpleApp:/Users/shirish_kumar/Developer/spark-0.9.0-incubating/conf:/Users/shirish_kumar/Developer/spark-0.9.0-incubating/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop1.0.4.jar" "-Xms512M" "-Xmx512M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker@192.168.1.106:53956/user/Worker" "SimpleApp"
14/02/25 18:55:55 ERROR OneForOneStrategy: FAILED (of class scala.Enumeration$Val)
scala.MatchError: FAILED (of class scala.Enumeration$Val)
        at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/02/25 18:55:55 INFO Worker: Starting Spark worker 192.168.1.106:53956 with 4 cores, 15.0 GB RAM
14/02/25 18:55:55 INFO Worker: Spark home: /Users/shirish_kumar/Developer/spark-0.9.0-incubating
14/02/25 18:55:55 INFO WorkerWebUI: Started Worker web UI at http://192.168.1.106:8081
14/02/25 18:55:55 INFO Worker: Connecting to master spark://Shirishs-MacBook-Pro.local:7077...
14/02/25 18:55:55 INFO Worker: Successfully registered with master spark://Shirishs-MacBook-Pro.local:7077

After this in the webUI - the worker is show is dead.

My question is - has anyone encountered this problem. The worker should not die if a job fails.

Check you /Spark/work folder. You can see the exact error for that particular driver.

For me its a class not found exception.Just give the fully qualified class name for the application main class(include package name too).

Then clear out the work directory and launch your application again in stand alone mode again. This will work....!

You have to specify the path to your JAR files.

Pragmatically, you can do it this way:

sparkConf.set("spark.jars", "file:/myjar1, file:/myjarN")

Which implies you have to first compile a JAR file.

You also have to link dependent JARs - of which there are multiple ways of automating, but well beyond the scope of this question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM