简体   繁体   中英

Spark getting line number zipWithIndex with wholeTextFiles

I have a use case where I have to read the files using wholeTextFiles. However, I need to produce line numbers in the file. If I use:

val file=sc.wholeTextFiles("path").zipWithIndex

I get one line number per file. How do I get a line number per line for each file?

One simple approach would be to flatten the loaded RDD using flatMap with a function that adds line numbers row-wise for each of the text files, as shown in the following:

import org.apache.spark.sql.Row

val rdd = sc.wholeTextFiles("/path/to/textfiles").
  flatMap{ case (fName, lines) =>
    lines.split("\\n").zipWithIndex.map{ case (line, idx) => (fName, idx, line) }
  }
// rdd: org.apache.spark.rdd.RDD[(String, Int, String)] = ...

Collect -ing the RDD should result in something like below:

rdd.collect
// res1: Array[(String, Int, String)] = Array(
//   ("/path/to/file1", 0, "text line 1 in file1"),
//   ("/path/to/file1", 1, "text line 2 in file1"),
//   ("/path/to/file1", 2, "text line 3 in file1"),
//       ...
//   ("/path/to/file2", 0, "text line 1 in file2"),
//   ("/path/to/file2", 1, "text line 2 in file2"),
//       ...
//       ...
// )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM