How to handle large text file in spark?

Question

I have a large textfile (3 GB) and it is DNA reference. I would like to slice it in parts so that i can handle it.

So I want to know how to slice the file with Spark. I am currently having only one node with 4 GB of memory

Answer 1

Sounds like you want to load your file as multiple partitions. If your file is splittable (text file, snappy, sequence, etc.), you can simply provide the number of partitions by which it will be loaded as sc.textFile(inputPath, numPartitions) . If your file is not splittable, it will be loaded as one partition, but you may call .repartition(numPartitions) on the loaded RDD to repartition into multiple partitions.

Answer 2

If you want some specific number of lines in your every chunk, you can try this:

rdd=sc.textFile(inputPath).zipWithIndex()
rdd2=rdd.filter(x=>lowest_no_of_line<=x._2 & x._2<=highest_no_of_line).map(x=>x._1).coalesce(1,false)
rdd2.saveAsTextFile(outputpath)

Now your saved textfile will have lines in between highest_no_of_line and lowest_no_of_line

How to handle large text file in spark?

Question

2 answers

solution1
1 ACCPTED 2015-10-04 18:46:40

solution2
0 2018-11-13 17:54:49

How to handle large text file in spark?

Question

2 answers

solution1 1 ACCPTED 2015-10-04 18:46:40

solution2 0 2018-11-13 17:54:49

solution1
1 ACCPTED 2015-10-04 18:46:40

solution2
0 2018-11-13 17:54:49