简体   繁体   中英

pyspark failed in google dataproc

My job failed with the following logs, however, I don't fully understand. It seems to be caused by

" YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 24.7 GB of 24 GB physical ".

But how can I increase the memory in google dataproc.

Logs:

16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 332.0 in stage 0.0 (TID 332, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 335.0 in stage 0.0 (TID 335, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 329.0 in stage 0.0 (TID 329, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Traceback (most recent call last):
  File "/tmp/5d6059b8-f9f4-4be6-9005-76c29a27af17/fetch.py", line 127, in <module>
    main()
  File "/tmp/5d6059b8-f9f4-4be6-9005-76c29a27af17/fetch.py", line 121, in main
    d.saveAsTextFile('gs://ll_hang/decahose-hashtags/data-multi3')
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1506, in saveAsTextFile
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 191 in stage 0.0 failed 4 times, most recent failure: Lost task 191.3 in stage 0.0 (TID 483, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1213)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1156)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1060)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:952)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:952)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:952)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:951)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1457)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1436)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1436)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1436)
    at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:507)
    at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:46)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 280.1 in stage 0.0 (TID 475, cluster-4-w-3.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 283.1 in stage 0.0 (TID 474, cluster-4-w-67.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 10.0 in stage 0.0 (TID 10, cluster-4-w-95.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9, cluster-4-w-95.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 184.1 in stage 0.0 (TID 463, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 81.0 in stage 0.0 (TID 81, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 85.0 in stage 0.0 (TID 85, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 84.0 in stage 0.0 (TID 84, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@27cb5c01,null)
16/05/05 01:12:42 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 438.1 in stage 0.0 (TID 442, cluster-4-w-23.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@71f24e3e,null)
16/05/05 01:12:42 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 114 idle
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 97.0 in stage 0.0 (TID 97, cluster-4-w-50.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 102.0 in stage 0.0 (TID 102, cluster-4-w-50.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@2ed7b1d,null)
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@1b339b4f,null)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 190.1 in stage 0.0 (TID 461, cluster-4-w-67.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 111.0 in stage 0.0 (TID 111, cluster-4-w-74.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 101.0 in stage 0.0 (TID 101, cluster-4-w-50.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 ERROR org.apache.spark.network.server.TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped.
    at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:161)
    at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
    at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:578)
    at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:170)
    at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:104)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at java.lang.Thread.run(Thread.java:745)
16/05/05 01:12:42 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

In Dataproc, Spark is configured to pack 1 executor per half machine, where an executor then runs multiple tasks in parallel depending on how many cores are available in half the machine. For example, on an n1-standard-4 , you'd expect each executor to use 2 cores and thus run two tasks in parallel at a time. The memory is similarly carved up, though some memory is also reserved for daemon services, some given to YARN executor overhead, etc.

This means in general, you have a few options for increasing the per-task memory:

  1. You can reduce spark.executor.cores by 1 at a time, down to a minimum of 1 ; since this leaves spark.executor.memory unchanged, in effect each parallel task now gets to share a larger portion of the per-executor memory. For example, on n1-standard-8 , the default setting would be spark.executor.cores=4 and executor memory would be something like 12GB or so, so each "task" would be able to use ~3GB of memory. If you set spark.executor.cores=3 , this leaves the executor memory at 12GB, and now each task gets ~4GB. You can at least try reducing it down to spark.executor.cores=1 to see if this approach will be feasible at all; then increase it as long as the job still succeeds to ensure good CPU utilization. You can do this at job-submission time:

     gcloud dataproc jobs submit pyspark --properties spark.executor.cores=1 ... 
  2. Alternatively, you can crank up spark.executor.memory ; just take a look at your cluster resource with gcloud dataproc clusters describe cluster-4 and you should see the current setting.

  3. If you don't want to waste cores, you may want to try a different machine type. For example, if you are currently using n1-standard-8 , try n1-highmem-8 instead. Dataproc still give half a machine per executor, so you'd end up with more memory per executor. You can also use custom machine types to fine-tune the balance of memory to CPU.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM