简体   繁体   中英

apache spark textfile to a string

val test= sc.textFile(12,logFile).cache()

In the above code snippet, I am trying to make apache spark to parallelize reading a huge text file. How do i store the contents of this onto a string ?

I was earlier doing this to read

val lines = scala.io.Source.fromFile(logFile, "utf-8").getLines.mkString

but then now i am trying to make the read faster using spark context.

Reading the file into a String through Spark is very unlikely to be faster than reading it directly - to work efficiently in Spark you should keep everything in RDD form and do your processing that way, only reducing down to a (small) value at the end. Reading it in Spark just means you'll read it into memory locally, serialize the chunks and send them out to your cluster nodes, then serialize them again to send them back to your local machine and gather them together. Spark is a powerful tool but it's not magical; it can only parallelize operations that are actually parallel. (Do you even know that reading the file into memory is the bottleneck? Always benchmark before optimizing)

But to answer your question, you could use

lines.toLocalIterator.mkString

Just don't expect it to be any faster than reading the file locally.

Collect the values, and then iterate them:

  var string = ""
  test.collect.foreach({i => string += i} )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM