简体   繁体   English

java.lang.OutOfMemoryError:无法获取100个字节的内存,得到0

[英]java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0

I'm invoking Pyspark with Spark 2.0 in local mode with the following command: 我使用以下命令在本地模式下使用Spark 2.0调用Pyspark:

pyspark --executor-memory 4g --driver-memory 4g

The input dataframe is being read from a tsv file and has 580 K x 28 columns. 输入数据帧正在从tsv文件中读取,并具有5​​80 K x 28列。 I'm doing a few operation on the dataframe and then i am trying to export it to a tsv file and i am getting this error. 我正在对数据帧进行一些操作,然后我尝试将其导出到tsv文件,我收到此错误。

df.coalesce(1).write.save("sample.tsv",format = "csv",header = 'true', delimiter = '\t')

Any pointers how to get rid of this error. 任何指针如何摆脱这个错误。 I can easily display the df or count the rows. 我可以轻松显示df或计算行数。

The output dataframe is 3100 rows with 23 columns 输出数据帧为3100行,共23列

Error: 错误:

Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 1073, localhost): org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0
    at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:129)
    at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
    at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
    at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.fetchNextRow(WindowExec.scala:300)
    at org.apache.spark.sql.execution.WindowExec$$anonfun$15$$anon$1.<init>(WindowExec.scala:309)
    at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:289)
    at org.apache.spark.sql.execution.WindowExec$$anonfun$15.apply(WindowExec.scala:288)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
    at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
    ... 8 more

Driver stacktrace:

I believe that the cause of this problem is coalesce() , which despite the fact that it avoids a full shuffle (like repartition would do ), it has to shrink the data in the requested number of partitions. 我相信这个问题的原因是coalesce() ,尽管事实上它避免了一个完整的shuffle(比如重新分区会这样做 ),但它必须缩小所请求数量的分区中的数据。

Here, you are requesting all the data to fit into one partition, thus one task (and only one task) has to work with all the data , which may cause its container to suffer from memory limitations. 在这里,您要求所有数据适合一个分区,因此一个任务(并且只有一个任务)必须处理所有数据 ,这可能导致其容器受到内存限制。

So, either ask for more partitions than 1, or avoid coalesce() in this case. 因此,要么提出比1更多的分区,要么在这种情况下避免使用coalesce()


Otherwise, you could try the solutions provided in the links below, for increasing your memory configurations: 否则,您可以尝试以下链接中提供的解决方案,以增加内存配置:

  1. Spark java.lang.OutOfMemoryError: Java heap space Spark java.lang.OutOfMemoryError:Java堆空间
  2. Spark runs out of memory when grouping by key 按键分组时Spark会耗尽内存

The problem for me was indeed coalesce() . 我的问题确实是coalesce() What I did was exporting the file not using coalesce() but parquet instead using df.write.parquet("testP") . 我所做的是导出文件不是使用coalesce()而是使用df.write.parquet("testP")而是镶木地板。 Then read back the file and export that with coalesce(1) . 然后回读文件并使用coalesce(1)导出该文件。

Hopefully it works for you as well. 希望它也适合你。

在我的情况下,用repartition(1)替换coalesce(1) repartition(1)工作。

As was stated in other answers, use repartition(1) instead of coalesce(1) . 如其他答案中所述,使用repartition(1)而不是coalesce(1) The reason is that repartition(1) will ensure that upstream processing is done in parallel (multiple tasks/partitions), rather than on only one executor. 原因是重新分区(1)将确保上游处理并行(多个任务/分区),而不是仅在一个执行器上完成。

To quote the Dataset.coalesce() Spark docs: 引用Dataset.coalesce() Spark文档:

However, if you're doing a drastic coalesce, eg to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (eg one node in the case of numPartitions = 1). 但是,如果您正在进行激烈的合并,例如numPartitions = 1,则可能导致您的计算发生在比您喜欢的节点更少的节点上(例如,numPartitions = 1时的一个节点)。 To avoid this, you can call repartition(1) instead. 为避免这种情况,您可以调用重新分区(1)。 This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). 这将添加一个shuffle步骤,但意味着当前的上游分区将并行执行(无论当前分区是什么)。

In my case the driver was smaller than the workers. 在我的情况下,司机比工人小。 Issue was resolved by making the driver larger. 通过使驱动程序更大来解决问题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 java.lang.OutOfMemoryError: 无法获取 36 字节内存,得到 0 - java.lang.OutOfMemoryError: Unable to acquire 36 bytes of memory, got 0 为什么我可以使用SparkSQL show()一个数据框,但无法将其写入json并得到“ java.lang.OutOfMemoryError” - Why I can show() a dataframe using SparkSQL but cannot write it to json and got “java.lang.OutOfMemoryError” 当所有内存设置都设置为巨大时,rdd.collect()中的java.lang.OutOfMemoryError - java.lang.OutOfMemoryError in rdd.collect() when all memory setting are set to huge Spark流:java.lang.OutOfMemoryError:Java堆空间 - Spark Streaming: java.lang.OutOfMemoryError: Java heap space PySpark应用程序失败,出现java.lang.OutOfMemoryError:Java堆空间 - PySpark application fail with java.lang.OutOfMemoryError: Java heap space Pyspark Java.lang.OutOfMemoryError:Java 堆空间 - Pyspark Java.lang.OutOfMemoryError: Java heap space Spark ALS:Java堆空间用尽:java.lang.OutOfMemoryError:Java堆空间 - Spark ALS: Running out of java heap space: java.lang.OutOfMemoryError: Java heap space python程序“ java.lang.OutOfMemoryError:Java堆空间”的火花错误 - Spark error for python program “java.lang.OutOfMemoryError: Java heap space” PySpark:线程“ dag-scheduler-event-loop”中的异常java.lang.OutOfMemoryError:Java堆空间 - PySpark: Exception in thread “dag-scheduler-event-loop” java.lang.OutOfMemoryError: Java heap space 使用 toPandas() 和 databricks 连接时遇到“java.lang.OutOfMemoryError: Java 堆空间” - Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM