[英]java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0
I'm invoking Pyspark with Spark 2.0 in local mode with the following command: 我使用以下命令在本地模式下使用Spark 2.0调用Pyspark:
pyspark --executor-memory 4g --driver-memory 4g
The input dataframe is being read from a tsv file and has 580 K x 28 columns. 输入数据帧正在从tsv文件中读取,并具有580 K x 28列。 I'm doing a few operation on the dataframe and then i am trying to export it to a tsv file and i am getting this error.
我正在对数据帧进行一些操作,然后我尝试将其导出到tsv文件,我收到此错误。
df.coalesce(1).write.save("sample.tsv",format = "csv",header = 'true', delimiter = '\t')
Any pointers how to get rid of this error. 任何指针如何摆脱这个错误。 I can easily display the df or count the rows.
我可以轻松显示df或计算行数。
The output dataframe is 3100 rows with 23 columns 输出数据帧为3100行,共23列
Error: 错误:
Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 1073, localhost): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:129)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.fetchNextRow(WindowExec.scala:300)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.<init>(WindowExec.scala:309)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:289)
at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:288)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more
Driver stacktrace:
I believe that the cause of this problem is coalesce() , which despite the fact that it avoids a full shuffle (like repartition would do ), it has to shrink the data in the requested number of partitions. 我相信这个问题的原因是coalesce() ,尽管事实上它避免了一个完整的shuffle(比如重新分区会这样做 ),但它必须缩小所请求数量的分区中的数据。
Here, you are requesting all the data to fit into one partition, thus one task (and only one task) has to work with all the data , which may cause its container to suffer from memory limitations. 在这里,您要求所有数据适合一个分区,因此一个任务(并且只有一个任务)必须处理所有数据 ,这可能导致其容器受到内存限制。
So, either ask for more partitions than 1, or avoid coalesce()
in this case. 因此,要么提出比1更多的分区,要么在这种情况下避免使用
coalesce()
。
Otherwise, you could try the solutions provided in the links below, for increasing your memory configurations: 否则,您可以尝试以下链接中提供的解决方案,以增加内存配置:
The problem for me was indeed coalesce()
. 我的问题确实是
coalesce()
。 What I did was exporting the file not using coalesce()
but parquet instead using df.write.parquet("testP")
. 我所做的是导出文件不是使用
coalesce()
而是使用df.write.parquet("testP")
而是镶木地板。 Then read back the file and export that with coalesce(1)
. 然后回读文件并使用
coalesce(1)
导出该文件。
Hopefully it works for you as well. 希望它也适合你。
在我的情况下,用repartition(1)
替换coalesce(1)
repartition(1)
工作。
As was stated in other answers, use repartition(1)
instead of coalesce(1)
. 如其他答案中所述,使用
repartition(1)
而不是coalesce(1)
。 The reason is that repartition(1) will ensure that upstream processing is done in parallel (multiple tasks/partitions), rather than on only one executor. 原因是重新分区(1)将确保上游处理并行(多个任务/分区),而不是仅在一个执行器上完成。
To quote the Dataset.coalesce() Spark docs: 引用Dataset.coalesce() Spark文档:
However, if you're doing a drastic coalesce, eg to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (eg one node in the case of numPartitions = 1).
但是,如果您正在进行激烈的合并,例如numPartitions = 1,则可能导致您的计算发生在比您喜欢的节点更少的节点上(例如,numPartitions = 1时的一个节点)。 To avoid this, you can call repartition(1) instead.
为避免这种情况,您可以调用重新分区(1)。 This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
这将添加一个shuffle步骤,但意味着当前的上游分区将并行执行(无论当前分区是什么)。
In my case the driver was smaller than the workers. 在我的情况下,司机比工人小。 Issue was resolved by making the driver larger.
通过使驱动程序更大来解决问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.