Spark: how to convert rdd.RDD[String] to rdd.RDD[(Array[Byte], Array[Byte])]

Question

I am reading a compressed file in spark using

val data =  sc.textFile(inputFile)

This gives me data as a RDD[string] . How to convert this to RDD[(Array[Byte], Array[Byte])] in scala?

More details about this requirement:

I am using terasort on spark. By default terasort does not write compressed output HDFS. To fix that issue added following code to TeraSort.scala file

sorted.saveAsHadoopFile(outputFile,classOf[Text],classOf[IntWritable],
classOf[TextOutputFormat[Text,IntWritable]],
classOf[org.apache.hadoop.io.compress.SnappyCodec])

This gives me compressed output file

Now I need to read this file to run TeraValidate. Teravalidate expects input in RDD[(Array[Byte], Array[Byte])] format.

Thanks

Answer 1

You can write the data compressed by passing the compression codec argument to the saveAsTextFile as follows:

import org.apache.hadoop.io.compress.GzipCodec
sorted.saveAsTextFile("/tmp/test/", classOf[GzipCodec])

You can read this compressed data back using the sc.textFile("/tmp/test/")

Now, to answer your real question, as Zohar said, you can use the .map to transform your String to Array[Byte], but you haven't given enough information on that.

Spark: how to convert rdd.RDD[String] to rdd.RDD[(Array[Byte], Array[Byte])]

Question

1 answers

solution1
0 2016-10-26 09:23:54

Spark: how to convert rdd.RDD[String] to rdd.RDD[(Array[Byte], Array[Byte])]

Question

1 answers

solution1 0 2016-10-26 09:23:54

solution1
0 2016-10-26 09:23:54