Why I can show() a dataframe using SparkSQL but cannot write it to json and got “java.lang.OutOfMemoryError”

Question

I used SparkSQL to precess data, and I want to write my data aa son file.

...
step12.show()
step12.repartition(10).coalesce(1).write.json('wasb://liu@cliubo.blob.core.windows.net/test_data_4')

step12 is my dataframe, but I got an error told me that java.lang.OutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0 It is no meaningful since I can show this dataframe. I use a cluster in Microsoft Azure, which is 16Gb and my original data is about 1Gb and this step12 would not beyond 2Mb.

Why this happen and how to solve it?

17/04/16 14:46:34 WARN TaskSetManager: Lost task 0.0 in stage 43.0 (TID 3113, 10.0.0.6, executor 1): org.apache.spark.SparkException: Task failed while writing rows
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.OutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0
        at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:127)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:154)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:121)
        at org.apache.spark.sql.execution.UnsafeExternalRowSorter.<init>(UnsafeExternalRowSorter.java:82)
        at org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown Source)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:374)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:371)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:988)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:979)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:919)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:979)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:697)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
        ... 8 more

Answer 1

First of all, coalesce and repartition are very similar. It is awkward and unnecessary to do both.

Moving on, if you look at the documentation for coalesce :

" However, if you're doing a drastic coalesce, eg to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (eg one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). "

You coalesced down to 1, so you can try setting the shuffle flag to true.

But I think the most important thing is to not just try whatever but to take the time to understand what the various operations do and how they work to understand what's really happening. For example, I have found that using glom , which has a legitimate purpose "in real life," can also be really helpful for me when I want to see how things are partitioned in the console as I work my way up to scale.

Answer 2

I think coalesce is creating the problem for you. coalesce avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept. So, it would go something like this:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

Then coalesce down to 2 partitions:

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

So in your case you are decreasing the number of partitions to 1 and which is causing memory issue. i think removing coalesce will solve the outofmemory error.

Answer 3

repartition,coalesce on same dataframe gives you poor performance and it may cause OOM errors.

I would like you to check number of partitions on step12 dataframe before you apply repartition/coalesce and check rows in each partition using following commands.

step12.partitions.size // lets say 50 partitons
step12.rdd.mapPartitions(iter => Array(iter.size).iterator, true).collect()

if you see any partitions having very less or empty rows you can decrease same number of partitions using coalesce. coalesce always make sure that less shuffle happens so that we get reasonable performance.

ex: Out of 50 partitions 40 partitions having empty or very less rows.

step12.coalesce(10).write.json('wasb://liu@cliubo.blob.core.windows.net/test_data_4')

This will create 10 files as output files.

Note: coalesce will not create equal size of output files.

If you want to create equal size of files then go with repartiton. But repartition will do more shuffle and gives poor performance.

Why I can show() a dataframe using SparkSQL but cannot write it to json and got “java.lang.OutOfMemoryError”

Question

3 answers

solution1
0 2017-04-16 17:17:39

solution2
0 2017-04-16 17:30:26

solution3
0 2017-04-16 17:31:27

Why I can show() a dataframe using SparkSQL but cannot write it to json and got “java.lang.OutOfMemoryError”

Question

3 answers

solution1 0 2017-04-16 17:17:39

solution2 0 2017-04-16 17:30:26

solution3 0 2017-04-16 17:31:27

solution1
0 2017-04-16 17:17:39

solution2
0 2017-04-16 17:30:26

solution3
0 2017-04-16 17:31:27