I am reading a compressed file in spark using
val data = sc.textFile(inputFile)
This gives me data as a RDD[string]
. How to convert this to RDD[(Array[Byte], Array[Byte])]
in scala?
More details about this requirement:
I am using terasort on spark. By default terasort does not write compressed output HDFS. To fix that issue added following code to TeraSort.scala file
sorted.saveAsHadoopFile(outputFile,classOf[Text],classOf[IntWritable],
classOf[TextOutputFormat[Text,IntWritable]],
classOf[org.apache.hadoop.io.compress.SnappyCodec])
This gives me compressed output file
Now I need to read this file to run TeraValidate. Teravalidate expects input in RDD[(Array[Byte], Array[Byte])]
format.
Thanks
You can write the data compressed by passing the compression codec argument to the saveAsTextFile as follows:
import org.apache.hadoop.io.compress.GzipCodec
sorted.saveAsTextFile("/tmp/test/", classOf[GzipCodec])
You can read this compressed data back using the sc.textFile("/tmp/test/")
Now, to answer your real question, as Zohar said, you can use the .map to transform your String to Array[Byte], but you haven't given enough information on that.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.