I am trying to read large gz file and, then inserting into table. this is taking so long.
sparkSession.read.format("csv").option("header", "true").load("file-about-5gb-size.gz").repartition( 1000).coalesce(1000).write.mode("overwrite").format("orc").insertInto(table)
Is there any way I can optimize this, please help.
Note: I have used random repartition and coalesce
You won't be able to do read optimization if your file is in gzip compression. The gzip compression is not splittable in spark. There's no way to avoid reading the complete file in the spark driver node.
If you want to parallelize, you need to make this file splittable by unzip
it and then process it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.