如何在Spark中处理大型GZ文件

Question

I am trying to read large gz file and, then inserting into table. 我正在尝试读取大的gz文件，然后将其插入表中。 this is taking so long. 这花了很长时间。

sparkSession.read.format("csv").option("header", "true").load("file-about-5gb-size.gz").repartition( 1000).coalesce(1000).write.mode("overwrite").format("orc").insertInto(table)

Is there any way I can optimize this, please help. 有什么我可以优化的方法，请帮忙。

Note: I have used random repartition and coalesce 注意：我使用了随机分区和合并

Answer 1

You won't be able to do read optimization if your file is in gzip compression. 如果文件采用gzip压缩，则将无法进行读取优化。 The gzip compression is not splittable in spark. gzip压缩无法在spark中拆分。 There's no way to avoid reading the complete file in the spark driver node. 无法避免在spark驱动程序节点中读取完整的文件。
If you want to parallelize, you need to make this file splittable by unzip it and then process it. 如果要并行化，则需要通过unzip然后处理该文件来使其可拆分。

如何在Spark中处理大型GZ文件

问题描述

1 个解决方案

解决方案1
1 2018-10-20 02:15:13

如何在Spark中处理大型GZ文件

问题描述

1 个解决方案

解决方案1 1 2018-10-20 02:15:13

解决方案1
1 2018-10-20 02:15:13