简体   繁体   中英

Reading a text file from HDFS in Scala/Spark

I am using Scala and Spark and want to read in an XML file as a single string. I'm struggling to find a neat Scala-ish way to do this.

My first thought was to use

val fileContents: RDD[String] = sparkContext.textfile(pathToFile)
val combinedContents: String = fileContents.reduce((line1, line2) => line1 + line2)

But I am concerned about this maintaining the order of lines that is important to keep the integrity of the xml contained within the string.

Other stuff I have found online to read files in HDFS involve using deprecated methods so I want to avoid those. Any ideas?

sc.textFile returns an RDD with "sorted lines". Note that if you have multiples files in provided path, files will be assigned to partitions also in alphabetical order (of the name of the file). So, as a conclusion, sc.textFile keeps the order of the lines.

As far as I can check reviewing implementation of collect() method, order is also kept, so there is no reason to not use directly:

sc.textFile(pathToFile).collect()

This should work.

However, if you want to be prepared for a different implementation of collect (since in documentation it is not guaranteed to keep the order) the solution I would propose is using RDD method zipWithIndex that is philosophically equivalent to scala's method with the same name.

So I would do something like this:

sc.textFile(pathToFile).zipWithIndex().collect().sortBy(_._2).map(_._1)

Options:

  1. Read whole file with:

sparkContext.wholeTextFiles(filePath)

But looks like overhead if you don't have many such files.

  1. Get HDFS file system object, and read file as InputStream. A lot of examples available: HDFS FileSystems API example

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM