在Hadoop DistributedCache上存储TreeSet

Question

I am trying to store a TreeSet on a DistributedCache for use by a Hadoop map-reduce job. 我试图在DistributedCache上存储TreeSet以供Hadoop map-reduce作业使用。 So far I have the following for adding a file from HDFS to a DistributedCache : 到目前为止，我有以下内容将HDFS中的文件添加到DistributedCache ：

Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/my/cache/path"), conf);
Job job = new Job(conf, "my job");
// Proceed with remainder of Hadoop map-reduce job set-up and running

How do I efficiently add a TreeSet (that I already have built in this class) to this file that I am adding to the DistributedCache? 如何有效地将TreeSet（我已在此类中构建）添加到我添加到DistributedCache的此文件中？ Should I be using Java's native serialization to somehow serialize this onto the file? 我应该使用Java的本机序列化以某种方式将其序列化到文件中吗？

Note that the TreeSet is built once in the main class that starts the map-reduce jobs. 请注意，TreeSet在启动map-reduce作业的主类中构建一次。 The TreeSet will never be modified and I simply want every mapper to have read-only access to this TreeSet without having to rebuild it over and over. TreeSet永远不会被修改，我只希望每个映射器都具有对此TreeSet的只读访问权限，而不必反复重建它。

Answer 1

Serializing the TreeSet seems to be the approach. 序列化TreeSet似乎就是这种方法。 You do not need to create a HashMap in this case. 在这种情况下，您不需要创建HashMap。 Just deserialize the TreeSet from the file and use the methods to search based on the key. 只需从文件中反序列化TreeSet，然后使用这些方法根据密钥进行搜索。 I like this approach. 我喜欢这种方法。

在Hadoop DistributedCache上存储TreeSet

问题描述

1 个解决方案

解决方案1
1 2013-04-22 03:16:58

在Hadoop DistributedCache上存储TreeSet

问题描述

1 个解决方案

解决方案1 1 2013-04-22 03:16:58

解决方案1
1 2013-04-22 03:16:58