简体   繁体   English

丢失的执行器尝试在Yarn / hdfs集群中使用Spark / GraphX加载图

[英]Lost Executor trying to load Graph using Spark/GraphX in Yarn/hdfs Cluster

I try to run a Spark/Graphx program written in Scala on YARN cluster with hdfs. 我尝试使用hdfs在YARN群集上运行用Scala编写的Spark / Graphx程序。 The cluster has 16 nodes with 16GB RAM and 2TB HD each. 该集群有16个节点,每个节点具有16GB RAM和2TB HD。 All I want is to load an 3.29GB undirected Graph (called orkutUndirected.txt) using edgeListFile function, provided by GraphX library: 我想要的是使用GraphX库提供的edgeListFile函数加载3.29GB的无向图(称为orkutUndirected.txt):

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import java.io._
import java.util.Date
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext

import scala.util.control.Breaks._
import scala.math._


object MyApp {

   def main(args: Array[String]): Unit = {

     // Create spark configuration and spark context
     val conf = new SparkConf().setAppName("My App")
     val sc = new SparkContext(conf)
     val edgeFile = "hdfs://master-bigdata:8020/user/sparklab/orkutUndirected.txt" 
      // Load the edges as a graph
     val graph =GraphLoader.edgeListFile(sc,edgeFile,false,1,StorageLevel.MEMORY_AND_DISK,StorageLevel.MEMORY_AND_DISK)
   }
}

I start the run using the following spark-submit in command line: 我在命令行中使用以下spark-submit开始运行:

nohup spark-submit --master yarn --executor-memory 7g --num-executors 4 --executor-cores 2 ./target/scala-2.10/myapp_2.10-1.0.jar &

I tried different sizes of --executor-memory but no luck!! 我尝试了不同大小的--executor-memory,但是没有运气! After a few minutes I can see the following inside nohup.out: 几分钟后,我可以在nohup.out中看到以下内容:

16/02/24 23:45:25 ERROR YarnScheduler: Lost executor 1 on node12-bigdata:     Executor heartbeat timed out after 160351 ms
16/02/24 23:45:29 ERROR YarnScheduler: Lost executor 1 on node12-bigdata:     remote Rpc client disassociated
16/02/25 00:04:08 ERROR YarnScheduler: Lost executor 3 on node13-bigdata:     remote Rpc client disassociated
16/02/25 00:18:05 ERROR YarnScheduler: Lost executor 4 on node06-bigdata:     Executor heartbeat timed out after 129723 ms
16/02/25 00:18:07 ERROR YarnScheduler: Lost executor 4 on node06-bigdata:     remote Rpc client disassociated
16/02/25 00:21:52 ERROR YarnScheduler: Lost executor 4 on node16-bigdata:     remote Rpc client disassociated
16/02/25 00:41:29 ERROR YarnScheduler: Lost executor 1 on node03-bigdata:     remote Rpc client disassociated
16/02/25 00:44:52 ERROR YarnScheduler: Lost executor 5 on node16-bigdata:     remote Rpc client disassociated
16/02/25 00:44:52 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 4 times, most     recent failure: Lost task 0.3 in stage 0.0 (TID 3, node16-bigdata):
ExecutorLostFailure (executor 5 lost)
Driver stacktrace:
at     org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)

... ... ......

Do you have any idea what could be wrong? 您知道什么地方可能出问题吗?

Depending on what types of objects you are creating from the raw text file, 3.9GB of raw data can easily exceed the working memory of this cluster, causing large GC pauses and lost executors. 根据您要从原始文本文件创建的对象类型,3.9GB的原始数据可能会轻易超出该群集的工作内存,从而导致大量的GC暂停和执行程序丢失。 In addition to the overhead that comes with wrapping data in Java objects, there is additional overhead that comes with GraphX on top of RDDs. 除了在Java对象中包装数据所带来的开销外,GraphX在RDD之上还具有其他开销。 VisualVM or Ganglia are good tools for debugging these memory related issues. VisualVMGanglia是调试这些内存相关问题的好工具。 Also, see Tuning Spark for tips on how to keep your graph lean. 另外,请参阅Tuning Spark有关如何保持图形精简的提示。

Another possibility is that the data was not partitioned optimally, causing some tasks stall. 另一种可能性是数据没有进行最佳分区,从而导致某些任务停顿。 See the Spark UI stage information and make sure that each task is working on evenly distributed data. 请参阅Spark UI阶段信息,并确保每个任务都在处理均匀分布的数据。 If it is not evenly distributed, you should repartition your data. 如果分布不均匀,则应重新分区数据。 I found the Cloudera Blog on this subject useful. 我发现关于此主题的Cloudera博客很有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM