简体   繁体   English

如何在Spark中处理大型GZ文件

[英]How to handle large gz file in Spark

I am trying to read large gz file and, then inserting into table. 我正在尝试读取大的gz文件,然后将其插入表中。 this is taking so long. 这花了很长时间。

sparkSession.read.format("csv").option("header", "true").load("file-about-5gb-size.gz").repartition( 1000).coalesce(1000).write.mode("overwrite").format("orc").insertInto(table)

Is there any way I can optimize this, please help. 有什么我可以优化的方法,请帮忙。

Note: I have used random repartition and coalesce 注意:我使用了随机分区和合并

You won't be able to do read optimization if your file is in gzip compression. 如果文件采用gzip压缩,则将无法进行读取优化。 The gzip compression is not splittable in spark. gzip压缩无法在spark中拆分。 There's no way to avoid reading the complete file in the spark driver node. 无法避免在spark驱动程序节点中读取完整的文件。
If you want to parallelize, you need to make this file splittable by unzip it and then process it. 如果要并行化,则需要通过unzip然后处理该文件来使其可拆分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM