简体   繁体   English

如何将对象文件另存为Spark中的其他目录?

[英]How to save as object files to different directories in Spark?

I hava an RDD below: 我在下面有一个RDD:

Array(
(0, "xx"),
(1, "xx"),
(2, "xx"),
(1, "yy")
)

I want to save it to different directories by the key. 我想通过密钥将其保存到其他目录。 For example, to create 3 files in those directories: 例如,要在这些目录中创建3个文件:

0/part-00000 // xx
1/part-00000 // xx and yy
2/part-00000 // xx

Through saveAsHadoopFile and MultipleTextOutputFormat , I can do it in text format . 通过saveAsHadoopFileMultipleTextOutputFormat ,我可以采用文本格式 However, this RDD contains huge complex data. 但是,此RDD包含巨大的复杂数据。 Saving it in compressed format may be better, like what saveAsObjectFile does. saveAsObjectFile一样,将其保存为压缩格式可能更好。

MultipleSequenceFileOutputFormat may help me realize it, but how to use it correctly? MultipleSequenceFileOutputFormat可以帮助我实现它,但是如何正确使用它呢?


EDIT : 编辑:

I have tried this to do it in text format: 我已经尝试过以文本格式执行此操作:

.saveAsHadoopFile(outputPath, classOf[Any], classOf[Any], classOf[MultiOutputFormat])
  class MultiOutputFormat() extends MultipleTextOutputFormat[Any, Any] {

    override def generateActualKey(key: Any, value: Any): Any = {
      NullWritable.get()
    }

    override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = {
      key.asInstanceOf[Int] + "/" + super.generateFileNameForKeyValue(key, value, name)
    }
  }

What version of spark are you using? 您正在使用什么版本的spark?

Have you tried something like 你尝试过类似的东西吗

.repartition(3).saveAsTextFile("/path/to/output", classOf[GzipCodec])

or 要么

sc.hadoopConfiguration.setClass(FileOutputFormat.COMPRESS_CODEC, classOf[GzipCodec], classOf[CompressionCodec])

?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM