简体   繁体   中英

How to handle large text file in spark?

I have a large textfile (3 GB) and it is DNA reference. I would like to slice it in parts so that i can handle it.

So I want to know how to slice the file with Spark. I am currently having only one node with 4 GB of memory

Sounds like you want to load your file as multiple partitions. If your file is splittable (text file, snappy, sequence, etc.), you can simply provide the number of partitions by which it will be loaded as sc.textFile(inputPath, numPartitions) . If your file is not splittable, it will be loaded as one partition, but you may call .repartition(numPartitions) on the loaded RDD to repartition into multiple partitions.

If you want some specific number of lines in your every chunk, you can try this:

rdd=sc.textFile(inputPath).zipWithIndex()
rdd2=rdd.filter(x=>lowest_no_of_line<=x._2 & x._2<=highest_no_of_line).map(x=>x._1).coalesce(1,false)
rdd2.saveAsTextFile(outputpath)

Now your saved textfile will have lines in between highest_no_of_line and lowest_no_of_line

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM