Spark使用wholeTextFiles获取行号zipWithIndex

Question

I have a use case where I have to read the files using wholeTextFiles.我有一个用例，我必须使用 wholeTextFiles 读取文件。 However, I need to produce line numbers in the file.但是，我需要在文件中生成行号。 If I use:如果我使用：

val file=sc.wholeTextFiles("path").zipWithIndex

I get one line number per file.我得到每个文件一个行号。 How do I get a line number per line for each file?如何为每个文件获取每行的行号？

Answer 1

One simple approach would be to flatten the loaded RDD using flatMap with a function that adds line numbers row-wise for each of the text files, as shown in the following:一种简单的方法是使用带有 function 的flatMap来展平加载的 RDD，该 function 为每个文本文件逐行添加行号，如下所示：

import org.apache.spark.sql.Row

val rdd = sc.wholeTextFiles("/path/to/textfiles").
  flatMap{ case (fName, lines) =>
    lines.split("\\n").zipWithIndex.map{ case (line, idx) => (fName, idx, line) }
  }
// rdd: org.apache.spark.rdd.RDD[(String, Int, String)] = ...

Collect -ing the RDD should result in something like below: Collect RDD 应该会产生如下结果：

rdd.collect
// res1: Array[(String, Int, String)] = Array(
//   ("/path/to/file1", 0, "text line 1 in file1"),
//   ("/path/to/file1", 1, "text line 2 in file1"),
//   ("/path/to/file1", 2, "text line 3 in file1"),
//       ...
//   ("/path/to/file2", 0, "text line 1 in file2"),
//   ("/path/to/file2", 1, "text line 2 in file2"),
//       ...
//       ...
// )

Spark使用wholeTextFiles获取行号zipWithIndex

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-11-17 02:58:44

Spark使用wholeTextFiles获取行号zipWithIndex

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-11-17 02:58:44

解决方案1
2 已采纳 2019-11-17 02:58:44