简体   繁体   English

如何在spark中处理大文本文件?

[英]How to handle large text file in spark?

I have a large textfile (3 GB) and it is DNA reference.我有一个大文本文件 (3 GB),它是 DNA 参考。 I would like to slice it in parts so that i can handle it.我想把它分成几部分,这样我就可以处理了。

So I want to know how to slice the file with Spark.所以我想知道如何使用 Spark 对文件进行切片。 I am currently having only one node with 4 GB of memory我目前只有一个具有 4 GB 内存的节点

Sounds like you want to load your file as multiple partitions.听起来您想将文件加载为多个分区。 If your file is splittable (text file, snappy, sequence, etc.), you can simply provide the number of partitions by which it will be loaded as sc.textFile(inputPath, numPartitions) .如果您的文件是可拆分的(文本文件、snappy、序列等),您可以简单地提供将加载的分区数作为sc.textFile(inputPath, numPartitions) If your file is not splittable, it will be loaded as one partition, but you may call .repartition(numPartitions) on the loaded RDD to repartition into multiple partitions.如果你的文件不可拆分,它会作为一个分区加载,但你可以在加载的 RDD 上调用.repartition(numPartitions)来重新分区为多个分区。

If you want some specific number of lines in your every chunk, you can try this:如果您想在每个块中包含特定数量的行,您可以尝试以下操作:

rdd=sc.textFile(inputPath).zipWithIndex()
rdd2=rdd.filter(x=>lowest_no_of_line<=x._2 & x._2<=highest_no_of_line).map(x=>x._1).coalesce(1,false)
rdd2.saveAsTextFile(outputpath)

Now your saved textfile will have lines in between highest_no_of_line and lowest_no_of_line现在您保存的文本文件将在highest_no_of_linelowest_no_of_line之间有highest_no_of_line lowest_no_of_line

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM