简体   繁体   中英

Spark: how to convert rdd.RDD[String] to rdd.RDD[(Array[Byte], Array[Byte])]

I am reading a compressed file in spark using

val data =  sc.textFile(inputFile)

This gives me data as a RDD[string] . How to convert this to RDD[(Array[Byte], Array[Byte])] in scala?

More details about this requirement:

I am using terasort on spark. By default terasort does not write compressed output HDFS. To fix that issue added following code to TeraSort.scala file

sorted.saveAsHadoopFile(outputFile,classOf[Text],classOf[IntWritable],
classOf[TextOutputFormat[Text,IntWritable]],
classOf[org.apache.hadoop.io.compress.SnappyCodec])

This gives me compressed output file

Now I need to read this file to run TeraValidate. Teravalidate expects input in RDD[(Array[Byte], Array[Byte])] format.

Thanks

You can write the data compressed by passing the compression codec argument to the saveAsTextFile as follows:

import org.apache.hadoop.io.compress.GzipCodec
sorted.saveAsTextFile("/tmp/test/", classOf[GzipCodec])

You can read this compressed data back using the sc.textFile("/tmp/test/")

在此处输入图片说明

Now, to answer your real question, as Zohar said, you can use the .map to transform your String to Array[Byte], but you haven't given enough information on that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM