如何将对象文件另存为Spark中的其他目录？

Question

I hava an RDD below: 我在下面有一个RDD：

Array(
(0, "xx"),
(1, "xx"),
(2, "xx"),
(1, "yy")
)

I want to save it to different directories by the key. 我想通过密钥将其保存到其他目录。 For example, to create 3 files in those directories: 例如，要在这些目录中创建3个文件：

0/part-00000 // xx
1/part-00000 // xx and yy
2/part-00000 // xx

Through saveAsHadoopFile and MultipleTextOutputFormat , I can do it in text format . 通过saveAsHadoopFile和MultipleTextOutputFormat ，我可以采用文本格式 。 However, this RDD contains huge complex data. 但是，此RDD包含巨大的复杂数据。 Saving it in compressed format may be better, like what saveAsObjectFile does. 像saveAsObjectFile一样，将其保存为压缩格式可能更好。

MultipleSequenceFileOutputFormat may help me realize it, but how to use it correctly? MultipleSequenceFileOutputFormat可以帮助我实现它，但是如何正确使用它呢？

EDIT : 编辑：

I have tried this to do it in text format: 我已经尝试过以文本格式执行此操作：

.saveAsHadoopFile(outputPath, classOf[Any], classOf[Any], classOf[MultiOutputFormat])

  class MultiOutputFormat() extends MultipleTextOutputFormat[Any, Any] {

    override def generateActualKey(key: Any, value: Any): Any = {
      NullWritable.get()
    }

    override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = {
      key.asInstanceOf[Int] + "/" + super.generateFileNameForKeyValue(key, value, name)
    }
  }

Answer 1

What version of spark are you using? 您正在使用什么版本的spark？

Have you tried something like 你尝试过类似的东西吗

.repartition(3).saveAsTextFile("/path/to/output", classOf[GzipCodec])

or 要么

sc.hadoopConfiguration.setClass(FileOutputFormat.COMPRESS_CODEC, classOf[GzipCodec], classOf[CompressionCodec])

? ？

如何将对象文件另存为Spark中的其他目录？

问题描述

1 个解决方案

解决方案1
0 2019-02-20 10:35:54

如何将对象文件另存为Spark中的其他目录？

问题描述

1 个解决方案

解决方案1 0 2019-02-20 10:35:54

解决方案1
0 2019-02-20 10:35:54