简体   繁体   English

用scala读取spark中的压缩文件

[英]reading compressed file in spark with scala

I am trying to read the content of .gz file in spark/scala in a dataframe/rdd using the following code 我正在尝试使用以下代码在dataframe / rdd中的spark / scala中读取.gz文件的内容

 val conf = new SparkConf()
 val sc = new SparkContext(conf)
    val data = sc.wholeTextFiles("path to gz file")
    data.collect().foreach(println);

.gz file is 28 mb and when i do the spark submit using this command .gz文件为28 mb,当我执行星火提交时使用此命令

spark-submit --class sample--master local[*] target\spark.jar

It gives ma Java Heap space issue in the console . 它在控制台中出现了Java堆空间问题。

Is this the best way of reading .gz file and if yes how could i solve java heap error issue . 这是读取.gz文件的最佳方法吗?如果是,我如何解决Java堆错误问题。

在此处输入图片说明

Thanks 谢谢

Disclaimer: That code and description will purely read in a small compressed text file using spark, collect it to an array of every line and print every line in the entire file to console. 免责声明:该代码和描述将完全使用spark读取到一个小的压缩文本文件中,将其收集到每一行的数组中,然后将整个文件中的每一行打印到控制台。 The number of ways and reasons to do this outside far outnumber those to do it in spark 外部进行此操作的方式和原因的数量远远超过了这样做的原因

1) use SparkSession instead of SparkContext if you can swing it. 1)如果可以摆动,请使用SparkSession代替SparkContext。 sparkSession.read.text() is the command to use (it automatically handles a few compression formats) 2) Or at least use sc.textFile() instead of wholeTextFiles 3) you're calling .collect on that data which brings the entire file back to the driver (in this case since you're local not network bound). sparkSession.read.text()是要使用的命令(它会自动处理一些压缩格式)2)或至少使用sc.textFile()而不是WholeTextFiles 3)您正在对该数据调用.collect,这将使整个文件返回驱动程序(在这种情况下,因为您是本地用户,所以没有网络绑定)。 Add the --driver-memory option to the spark shell to increase memory if you MUST do the collect. 如果必须进行收集,请在Spark Shell中添加--driver-memory选项以增加内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM