简体   繁体   English

Pyspark - java.lang.OutOfMemoryError:作为独立应用程序运行时,但作为 docker 运行时没有错误

[英]Pyspark - java.lang.OutOfMemoryError: when running as standalone application but NO error when running as docker

Getting Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: Java heap space when running a pyspark application in Standalone mode but everything runs fine when running in Docker container在线程“dispatcher-event-loop-0”中获取异常java.lang.OutOfMemoryError:在独立模式下运行pyspark应用程序时的Java堆空间,但在Docker容器中运行时一切正常

I have a simple recommendation application, that uses Pyspark for quicker processing.我有一个简单的推荐应用程序,它使用 Pyspark 进行更快的处理。 The dataset has 1m records.数据集有 1m 条记录。

When I run the application locally, I am getting Java OutofMemory Error but when I containerise and run the container locally everything runs fine... Everything is the same in standalone app and docker container... below are the details..当我在本地运行应用程序时,我收到 Java OutofMemory 错误,但是当我在本地容器化并运行容器时,一切都运行良好......独立应用程序和 docker 容器中的一切都相同......下面是详细信息......

Here is the part of Dockerfile...这是 Dockerfile 的一部分...

RUN apt-get update && apt-get install -qq -y \
build-essential libpq-dev --no-install-recommends && \
apt-get install -y software-properties-common

RUN apt-get install -y openjdk-8-jre && \
apt-get install -y openjdk-8-jdk
RUN echo "JAVA_HOME=$(which java)" | tee -a /etc/environment

Here is the pyspark code这是pyspark代码

    sc = SparkContext('local')
    sqlContext = SQLContext(sc)

    sc.setCheckpointDir('temp/')

    df = sqlContext.createDataFrame(user_posr_rate_df)
    sc.parallelize(df.collect())

I expect the results when running as standalone application to match the results when running in docker container... Below are respective results我希望作为独立应用程序运行时的结果与在 docker 容器中运行时的结果相匹配......以下是各自的结果

Results when running in Docker:在 Docker 中运行时的结果:

 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
 19/08/16 11:54:26 WARN TaskSetManager: Stage 0 contains a task of very large size (12230 KB). The maximum recommended task size is 100 KB.
 19/08/16 11:54:35 WARN TaskSetManager: Stage 1 contains a task of very large size (12230 KB). The maximum recommended task size is 100 KB.
 19/08/16 11:54:37 WARN TaskSetManager: Stage 3 contains a task of very large size (12230 KB). The maximum recommended task size is 100 KB.
 19/08/16 11:54:40 WARN TaskSetManager: Stage 5 contains a task of very large size (12230 KB). The maximum recommended task size is 100 KB.
  19/08/16 11:54:41 WARN TaskSetManager: Stage 6 contains a task of very large size (12230 KB). The maximum recommended task size is 100 KB.
 19/08/16 11:54:42 WARN TaskSetManager: Stage 7 contains a task of very large size (12230 KB). The maximum recommended task size is 100 KB.
 19/08/16 11:54:43 WARN TaskSetManager: Stage 8 contains a task of very large size (12230 KB). The maximum recommended task size is 100 KB.

Results when running locally as a standalone application:在本地作为独立应用程序运行时的结果:

 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
 19/08/16 17:50:20 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
 19/08/16 16:51:27 WARN TaskSetManager: Stage 0 contains a task of very large size (158329 KB). The maximum recommended task size is 100 KB.
 Exception in thread "dispatcher-event-loop-0" 
 java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:486)
at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:467)
at scala.Option.map(Option.scala:146)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:467)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:326)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:321)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$12.apply(TaskSchedulerImpl.scala:423)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$12.apply(TaskSchedulerImpl.scala:420)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:420)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:407)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:407)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:86)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:64)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)

Added config parameters to SparkContext, resolved my issue向 SparkContext 添加了配置参数,解决了我的问题

conf = SparkConf().setAll([('spark.executor.memory', '10g'), 
('spark.executor.cores', '3'), ('spark.cores.max', '3'), 
('spark.driver.memory','8g')])

sc = SparkContext(conf=conf)

Basically, added conf to SparkContext基本上,将 conf 添加到 SparkContext

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM