为什么我可以使用SparkSQL show（）一个数据框，但无法将其写入json并得到“ java.lang.OutOfMemoryError”

Question

I used SparkSQL to precess data, and I want to write my data aa son file. 我使用SparkSQL处理数据，我想将数据写入一个子文件。

...
step12.show()
step12.repartition(10).coalesce(1).write.json('wasb://liu@cliubo.blob.core.windows.net/test_data_4')

step12 is my dataframe, but I got an error told me that java.lang.OutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0 It is no meaningful since I can show this dataframe. step12是我的数据帧，但是我得到一个错误告诉我java.lang.OutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0这没有意义，因为我可以显示此数据帧。 I use a cluster in Microsoft Azure, which is 16Gb and my original data is about 1Gb and this step12 would not beyond 2Mb. 我在Microsoft Azure中使用的群集为16Gb，而我的原始数据约为1Gb，而这一step12不会超过2Mb。

Why this happen and how to solve it? 为什么会发生这种情况以及如何解决？

17/04/16 14:46:34 WARN TaskSetManager: Lost task 0.0 in stage 43.0 (TID 3113, 10.0.0.6, executor 1): org.apache.spark.SparkException: Task failed while writing rows
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.OutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0
        at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:127)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:154)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:121)
        at org.apache.spark.sql.execution.UnsafeExternalRowSorter.<init>(UnsafeExternalRowSorter.java:82)
        at org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown Source)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:374)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:371)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:988)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:979)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:919)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:979)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:697)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
        ... 8 more

Answer 1

First of all, coalesce and repartition are very similar. 首先， coalesce和repartition非常相似。 It is awkward and unnecessary to do both. 两者都笨拙且不必要。

Moving on, if you look at the documentation for coalesce : 继续，如果您查看coalesce文档：

" However, if you're doing a drastic coalesce, eg to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (eg one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). " “ 但是，如果您进行大量合并，例如将numPartitions = 1进行计算，则可能会导致您的计算在少于您希望的节点上进行（例如，在numPartitions = 1的情况下为一个节点）。为避免这种情况，您可以可以传递shuffle = true。这将添加一个shuffle步骤，但是意味着当前的上游分区将并行执行（无论当前分区是什么）。

You coalesced down to 1, so you can try setting the shuffle flag to true. 您将其合并为1，因此可以尝试将shuffle标志设置为true。

But I think the most important thing is to not just try whatever but to take the time to understand what the various operations do and how they work to understand what's really happening. 但是我认为最重要的是不仅要尝试任何事情，还要花时间了解各种操作的工作方式以及它们如何工作以了解实际情况。 For example, I have found that using glom , which has a legitimate purpose "in real life," can also be really helpful for me when I want to see how things are partitioned in the console as I work my way up to scale. 例如，我发现使用glom具有“现实生活中”的合理目的，当我想了解如何在控制台中按比例扩展内容时，它对我也真的很有帮助。

Answer 2

I think coalesce is creating the problem for you. 我认为合并正在为您制造问题。 coalesce avoids a full shuffle. 合并避免完全洗牌。 If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept. 如果知道数量正在减少，那么执行程序可以安全地将数据保留在最小分区数上，只需将数据从多余的节点移到我们保留的节点上即可。 So, it would go something like this: 因此，它将如下所示：

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

Then coalesce down to 2 partitions: 然后合并为2个分区：

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

So in your case you are decreasing the number of partitions to 1 and which is causing memory issue. 因此，在您的情况下，您要将分区数减少到1，这会导致内存问题。 i think removing coalesce will solve the outofmemory error. 我认为删除合并将解决内存不足错误。

Answer 3

repartition,coalesce on same dataframe gives you poor performance and it may cause OOM errors. 在同一数据帧上重新分区，合并会降低性能，并可能导致OOM错误。

I would like you to check number of partitions on step12 dataframe before you apply repartition/coalesce and check rows in each partition using following commands. 我希望您在应用重新分区/合并之前检查step12数据帧上的分区数，并使用以下命令检查每个分区中的行。

step12.partitions.size // lets say 50 partitons
step12.rdd.mapPartitions(iter => Array(iter.size).iterator, true).collect()

if you see any partitions having very less or empty rows you can decrease same number of partitions using coalesce. 如果看到任何分区的行数很少或为空，则可以使用合并减少相同数量的分区。 coalesce always make sure that less shuffle happens so that we get reasonable performance. 合并始终确保减少混洗，以便我们获得合理的性能。

ex: Out of 50 partitions 40 partitions having empty or very less rows. 例如：在50个分区中，有40个分区的行为空或很少。

step12.coalesce(10).write.json('wasb://liu@cliubo.blob.core.windows.net/test_data_4')

This will create 10 files as output files. 这将创建10个文件作为输出文件。

Note: coalesce will not create equal size of output files. 注意：合并不会创建相等大小的输出文件。

If you want to create equal size of files then go with repartiton. 如果要创建相等大小的文件，请重新分配。 But repartition will do more shuffle and gives poor performance. 但是重新分区会造成更多改组，并导致性能下降。

为什么我可以使用SparkSQL show（）一个数据框，但无法将其写入json并得到“ java.lang.OutOfMemoryError”

问题描述

3 个解决方案

解决方案1
0 2017-04-16 17:17:39

解决方案2
0 2017-04-16 17:30:26

解决方案3
0 2017-04-16 17:31:27

为什么我可以使用SparkSQL show（）一个数据框，但无法将其写入json并得到“ java.lang.OutOfMemoryError”

问题描述

3 个解决方案

解决方案1 0 2017-04-16 17:17:39

解决方案2 0 2017-04-16 17:30:26

解决方案3 0 2017-04-16 17:31:27

解决方案1
0 2017-04-16 17:17:39

解决方案2
0 2017-04-16 17:30:26

解决方案3
0 2017-04-16 17:31:27