简体   繁体   中英

reading compressed file in spark with scala

I am trying to read the content of .gz file in spark/scala in a dataframe/rdd using the following code

 val conf = new SparkConf()
 val sc = new SparkContext(conf)
    val data = sc.wholeTextFiles("path to gz file")
    data.collect().foreach(println);

.gz file is 28 mb and when i do the spark submit using this command

spark-submit --class sample--master local[*] target\spark.jar

It gives ma Java Heap space issue in the console .

Is this the best way of reading .gz file and if yes how could i solve java heap error issue .

在此处输入图片说明

Thanks

Disclaimer: That code and description will purely read in a small compressed text file using spark, collect it to an array of every line and print every line in the entire file to console. The number of ways and reasons to do this outside far outnumber those to do it in spark

1) use SparkSession instead of SparkContext if you can swing it. sparkSession.read.text() is the command to use (it automatically handles a few compression formats) 2) Or at least use sc.textFile() instead of wholeTextFiles 3) you're calling .collect on that data which brings the entire file back to the driver (in this case since you're local not network bound). Add the --driver-memory option to the spark shell to increase memory if you MUST do the collect.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM