简体   繁体   中英

Repartition() causes spark job to fail

I have a spark job that runs file with the below code. However this step create several files in the output folder.

sampledataframe.write.mode('append').partitionBy('DATE_FIELD').save(FILEPATH)

So I have started to use the below line of code to repartition those files and have one single file.

sampledataframe.repartition('DATE_FIELD').write.mode('append').partitionBy('DATE_FIELD').save(FILEPATH)

This code worked fine for several months but started failing recently with the following error.

[2019-09-26 16:15:48,030] {bash_operator.py:74} INFO - 19/09/26 16:15:48 WARN TaskSetManager: Lost task 48.0 in stage 1.0 (TID 812, aaa.bbb.io): org.apache.spark.SparkException: Task failed while writing rows
[2019-09-26 16:15:48,031] {bash_operator.py:74} INFO - at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:417)
[2019-09-26 16:15:48,031] {bash_operator.py:74} INFO - at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
[2019-09-26 16:15:48,031] {bash_operator.py:74} INFO - at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
[2019-09-26 16:15:48,031] {bash_operator.py:74} INFO - at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
[2019-09-26 16:15:48,031] {bash_operator.py:74} INFO - at org.apache.spark.scheduler.Task.run(Task.scala:89)
[2019-09-26 16:15:48,032] {bash_operator.py:74} INFO - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
[2019-09-26 16:15:48,032] {bash_operator.py:74} INFO - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2019-09-26 16:15:48,032] {bash_operator.py:74} INFO - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2019-09-26 16:15:48,032] {bash_operator.py:74} INFO - at java.lang.Thread.run(Thread.java:748)
[2019-09-26 16:15:48,032] {bash_operator.py:74} INFO - Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

Has anyone encountered this error before? Can you please let me know how I can fix this?

I think is something you have to do with memory allocation. Recently you might have more data to process which might cause the problem like timeout/skewness etc...

Is there any data skewness in any of the task. Can you check it. Also please share your cluster conf and your spark-submit memory parameters.

This primarily seems to be an issue where the executors do not get enough memory. As you are trying to create a single file, the code will require enough memory to support the shuffled writes.

If the file gets too huge, the memory on the master node becomes bottleneck.

Possible solutions would be:

  1. Check the resource usage for master and increase the same if it
    seems to be over utilized.
  2. A long term solution would be to update the dependent modules to read part files to make the task scalable and then you can start writing part files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM