[英]Spark getting line number zipWithIndex with wholeTextFiles
I have a use case where I have to read the files using wholeTextFiles.我有一个用例,我必须使用 wholeTextFiles 读取文件。 However, I need to produce line numbers in the file.但是,我需要在文件中生成行号。 If I use:如果我使用:
val file=sc.wholeTextFiles("path").zipWithIndex
I get one line number per file.我得到每个文件一个行号。 How do I get a line number per line for each file?如何为每个文件获取每行的行号?
One simple approach would be to flatten the loaded RDD using flatMap
with a function that adds line numbers row-wise for each of the text files, as shown in the following:一种简单的方法是使用带有 function 的flatMap
来展平加载的 RDD,该 function 为每个文本文件逐行添加行号,如下所示:
import org.apache.spark.sql.Row
val rdd = sc.wholeTextFiles("/path/to/textfiles").
flatMap{ case (fName, lines) =>
lines.split("\\n").zipWithIndex.map{ case (line, idx) => (fName, idx, line) }
}
// rdd: org.apache.spark.rdd.RDD[(String, Int, String)] = ...
Collect
-ing the RDD should result in something like below: Collect
RDD 应该会产生如下结果:
rdd.collect
// res1: Array[(String, Int, String)] = Array(
// ("/path/to/file1", 0, "text line 1 in file1"),
// ("/path/to/file1", 1, "text line 2 in file1"),
// ("/path/to/file1", 2, "text line 3 in file1"),
// ...
// ("/path/to/file2", 0, "text line 1 in file2"),
// ("/path/to/file2", 1, "text line 2 in file2"),
// ...
// ...
// )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.