如何在spark中处理大文本文件？

Question

I have a large textfile (3 GB) and it is DNA reference.我有一个大文本文件 (3 GB)，它是 DNA 参考。 I would like to slice it in parts so that i can handle it.我想把它分成几部分，这样我就可以处理了。

So I want to know how to slice the file with Spark.所以我想知道如何使用 Spark 对文件进行切片。 I am currently having only one node with 4 GB of memory我目前只有一个具有 4 GB 内存的节点

Answer 1

Sounds like you want to load your file as multiple partitions.听起来您想将文件加载为多个分区。 If your file is splittable (text file, snappy, sequence, etc.), you can simply provide the number of partitions by which it will be loaded as sc.textFile(inputPath, numPartitions) .如果您的文件是可拆分的（文本文件、snappy、序列等），您可以简单地提供将加载的分区数作为sc.textFile(inputPath, numPartitions) 。 If your file is not splittable, it will be loaded as one partition, but you may call .repartition(numPartitions) on the loaded RDD to repartition into multiple partitions.如果你的文件不可拆分，它会作为一个分区加载，但你可以在加载的 RDD 上调用.repartition(numPartitions)来重新分区为多个分区。

Answer 2

If you want some specific number of lines in your every chunk, you can try this:如果您想在每个块中包含特定数量的行，您可以尝试以下操作：

rdd=sc.textFile(inputPath).zipWithIndex()
rdd2=rdd.filter(x=>lowest_no_of_line<=x._2 & x._2<=highest_no_of_line).map(x=>x._1).coalesce(1,false)
rdd2.saveAsTextFile(outputpath)

Now your saved textfile will have lines in between highest_no_of_line and lowest_no_of_line现在您保存的文本文件将在highest_no_of_line和lowest_no_of_line之间有highest_no_of_line lowest_no_of_line

如何在spark中处理大文本文件？

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-10-04 18:46:40

解决方案2
0 2018-11-13 17:54:49

如何在spark中处理大文本文件？

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-10-04 18:46:40

解决方案2 0 2018-11-13 17:54:49

解决方案1
1 已采纳 2015-10-04 18:46:40

解决方案2
0 2018-11-13 17:54:49