简体   繁体   English

Spark使用wholeTextFiles获取行号zipWithIndex

[英]Spark getting line number zipWithIndex with wholeTextFiles

I have a use case where I have to read the files using wholeTextFiles.我有一个用例,我必须使用 wholeTextFiles 读取文件。 However, I need to produce line numbers in the file.但是,我需要在文件中生成行号。 If I use:如果我使用:

val file=sc.wholeTextFiles("path").zipWithIndex

I get one line number per file.我得到每个文件一个行号。 How do I get a line number per line for each file?如何为每个文件获取每行的行号?

One simple approach would be to flatten the loaded RDD using flatMap with a function that adds line numbers row-wise for each of the text files, as shown in the following:一种简单的方法是使用带有 function 的flatMap来展平加载的 RDD,该 function 为每个文本文件逐行添加行号,如下所示:

import org.apache.spark.sql.Row

val rdd = sc.wholeTextFiles("/path/to/textfiles").
  flatMap{ case (fName, lines) =>
    lines.split("\\n").zipWithIndex.map{ case (line, idx) => (fName, idx, line) }
  }
// rdd: org.apache.spark.rdd.RDD[(String, Int, String)] = ...

Collect -ing the RDD should result in something like below: Collect RDD 应该会产生如下结果:

rdd.collect
// res1: Array[(String, Int, String)] = Array(
//   ("/path/to/file1", 0, "text line 1 in file1"),
//   ("/path/to/file1", 1, "text line 2 in file1"),
//   ("/path/to/file1", 2, "text line 3 in file1"),
//       ...
//   ("/path/to/file2", 0, "text line 1 in file2"),
//   ("/path/to/file2", 1, "text line 2 in file2"),
//       ...
//       ...
// )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM