Reading a text file from HDFS in Scala/Spark

Question

I am using Scala and Spark and want to read in an XML file as a single string. I'm struggling to find a neat Scala-ish way to do this.

My first thought was to use

val fileContents: RDD[String] = sparkContext.textfile(pathToFile)
val combinedContents: String = fileContents.reduce((line1, line2) => line1 + line2)

But I am concerned about this maintaining the order of lines that is important to keep the integrity of the xml contained within the string.

Other stuff I have found online to read files in HDFS involve using deprecated methods so I want to avoid those. Any ideas?

Answer 1

sc.textFile returns an RDD with "sorted lines". Note that if you have multiples files in provided path, files will be assigned to partitions also in alphabetical order (of the name of the file). So, as a conclusion, sc.textFile keeps the order of the lines.

As far as I can check reviewing implementation of collect() method, order is also kept, so there is no reason to not use directly:

sc.textFile(pathToFile).collect()

This should work.

However, if you want to be prepared for a different implementation of collect (since in documentation it is not guaranteed to keep the order) the solution I would propose is using RDD method zipWithIndex that is philosophically equivalent to scala's method with the same name.

So I would do something like this:

sc.textFile(pathToFile).zipWithIndex().collect().sortBy(_._2).map(_._1)

Answer 2

Options:

Read whole file with:

sparkContext.wholeTextFiles(filePath)

But looks like overhead if you don't have many such files.

Get HDFS file system object, and read file as InputStream. A lot of examples available: HDFS FileSystems API example

Reading a text file from HDFS in Scala/Spark

Question

2 answers

solution1
0 ACCPTED 2017-12-30 17:42:08

solution2
0 2017-12-30 22:01:10

Reading a text file from HDFS in Scala/Spark

Question

2 answers

solution1 0 ACCPTED 2017-12-30 17:42:08

solution2 0 2017-12-30 22:01:10

solution1
0 ACCPTED 2017-12-30 17:42:08

solution2
0 2017-12-30 22:01:10