How to handle large gz file in Spark

Question

I am trying to read large gz file and, then inserting into table. this is taking so long.

sparkSession.read.format("csv").option("header", "true").load("file-about-5gb-size.gz").repartition( 1000).coalesce(1000).write.mode("overwrite").format("orc").insertInto(table)

Is there any way I can optimize this, please help.

Note: I have used random repartition and coalesce

Answer 1

You won't be able to do read optimization if your file is in gzip compression. The gzip compression is not splittable in spark. There's no way to avoid reading the complete file in the spark driver node.
If you want to parallelize, you need to make this file splittable by unzip it and then process it.

How to handle large gz file in Spark

Question

1 answers

solution1
1 2018-10-20 02:15:13

How to handle large gz file in Spark

Question

1 answers

solution1 1 2018-10-20 02:15:13

solution1
1 2018-10-20 02:15:13