简体   繁体   中英

Spark OutOfMemoryError when adding executors

I try to run the LBFGS example of MLlib: https://spark.apache.org/docs/1.0.0/mllib-optimization.html#limited-memory-bfgs-l-bfgs over a large dataset (~100GB), using DISK_ONLY persistence. I use 16GB for the driver and 16GB per executor.

Everything runs smoothly when I use few executors (10). But I got OutOfMemoryError: Java heap space on the driver when I try to use more executors (40). I think it might be related to the level of parallelism used (as indicated in https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism ).

I tried to set spark.default.parallelism to something large (from 5000 to 15000) but I still have the same problem, and it doesn't seem to be taken into account (there are around 500 tasks per job) even if it is set in the environment tab.

I use Spark 1.0.0 with Java over a Yarn cluster. I set the default parallelism with SparkConf conf = new SparkConf().set("spark.default.parallelism", "15000"); .

Stack trace:

14/10/20 11:25:16 INFO TaskSetManager: Starting task 30.0:20 as TID 60630 on executor 17: a4-5d-36-fc-ef-54.hpc.criteo.preprod (PROCESS_LOCAL)
14/10/20 11:25:16 INFO TaskSetManager: Serialized task 30.0:20 as 127544326 bytes in 227 ms
14/10/20 11:25:59 INFO TaskSetManager: Starting task 30.0:68 as TID 60631 on executor 10: a4-5d-36-fc-9f-2c.hpc.criteo.preprod (PROCESS_LOCAL)
14/10/20 11:25:59 ERROR ActorSystemImpl: Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-5] shutting down ActorSystem [spark]
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2271)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
    at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
    at java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:458)
    at org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:49)
    at sun.reflect.GeneratedMethodAccessor98.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$launchTasks$1.apply(CoarseGrainedSchedulerBackend.scala:145)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$launchTasks$1.apply(CoarseGrainedSchedulerBackend.scala:143)
    at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.launchTasks(CoarseGrainedSchedulerBackend.scala:143)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:131)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
    at akka.actor.ActorCell.invoke(ActorCell.scala:456)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
    at akka.dispatch.Mailbox.run(Mailbox.scala:219)
14/10/20 11:25:59 INFO DAGScheduler: Failed to run aggregate at LBFGS.scala:201
14/10/20 11:25:59 INFO ApplicationMaster: finishApplicationMaster with FAILED
14/10/20 11:25:59 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
Exception in thread "Thread-4" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:187)
Caused by: org.apache.spark.SparkException: Job cancelled because SparkContext was shut down
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:639)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:638)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
    at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:638)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1215)
    at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:201)
    at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
    at akka.actor.ActorCell.terminate(ActorCell.scala:338)
    at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
    at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
    at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
    at akka.dispatch.Mailbox.run(Mailbox.scala:218)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Any ideas of why is this error happening and how can I solve it ?

Following this mail recommendations http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3C49229E870391FC49BBBED818C268753D70587CCC@SZXEMA501-MBX.china.huawei.com%3E I think it has been caused by the aggregation method used by Spark. I upgraded to Spark 1.1 and everything was fine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM