Spark getting line number zipWithIndex with wholeTextFiles

Question

I have a use case where I have to read the files using wholeTextFiles. However, I need to produce line numbers in the file. If I use:

val file=sc.wholeTextFiles("path").zipWithIndex

I get one line number per file. How do I get a line number per line for each file?

Answer 1

One simple approach would be to flatten the loaded RDD using flatMap with a function that adds line numbers row-wise for each of the text files, as shown in the following:

import org.apache.spark.sql.Row

val rdd = sc.wholeTextFiles("/path/to/textfiles").
  flatMap{ case (fName, lines) =>
    lines.split("\\n").zipWithIndex.map{ case (line, idx) => (fName, idx, line) }
  }
// rdd: org.apache.spark.rdd.RDD[(String, Int, String)] = ...

Collect -ing the RDD should result in something like below:

rdd.collect
// res1: Array[(String, Int, String)] = Array(
//   ("/path/to/file1", 0, "text line 1 in file1"),
//   ("/path/to/file1", 1, "text line 2 in file1"),
//   ("/path/to/file1", 2, "text line 3 in file1"),
//       ...
//   ("/path/to/file2", 0, "text line 1 in file2"),
//   ("/path/to/file2", 1, "text line 2 in file2"),
//       ...
//       ...
// )

Spark getting line number zipWithIndex with wholeTextFiles

Question

1 answers

solution1
2 ACCPTED 2019-11-17 02:58:44

Spark getting line number zipWithIndex with wholeTextFiles

Question

1 answers

solution1 2 ACCPTED 2019-11-17 02:58:44

solution1
2 ACCPTED 2019-11-17 02:58:44