读取许多文件hadoop mapreduce分布式缓存

Question

I am having a set of files say 10 files and a single large file which is the sum of all the 10 files. 我有一组文件，例如10个文件和一个大文件，这是所有10个文件的总和。

I ad them in distributed cache, job conf. 我把它们放在分布式缓存中，即作业conf。

When I read them in reduce, i observe the following things: 当我在reduce中阅读它们时，我观察到以下几点：

I read only selected files which are added in the distributed cache in the reduce method. 我只读取在reduce方法中添加到分布式缓存中的选定文件。 I expected the speed to be faster as the file size read in each reduce is smaller as compared to reading the large file in all the reduce methods. 我希望速度会更快，因为与在所有reduce方法中读取大文件相比，每次reduce中读取的文件大小都较小。 But, it was slower. 但是，它比较慢。
Also, when I split it to even smaller files and added them to distributed cache, the problem got worse. 另外，当我将其拆分为更小的文件并将其添加到分布式缓存时，问题变得更加严重。 The job itself started running only after a long while. 不久之后，作业本身才开始运行。

I am unable to find the reason. 我找不到原因。 Pls help. 请帮助。

Answer 1

I think your problem lies in reading the file in reduce(). 我认为您的问题在于在reduce（）中读取文件。 You should read the files in configure() (using old API) or setup() (using the new API). 您应该在configure（）（使用旧API）或setup（）（使用新API）中读取文件。 So for every reducer it will be read just once, rather than reading it for each and every input group to the reducer (basically, each call to reduce method) 因此，对于每个reducer来说，它只会被读取一次，而不是为reducer的每个输入组读取它（基本上是，每个对reduce方法的调用）

You can write something like: Using NEW mapreduce API (org.apache.hadoop.mapreduce.*) - 您可以编写类似于：使用NEW mapreduce API（org.apache.hadoop.mapreduce。*）-

    public static class ReduceJob extends Reducer<Text, Text, Text, Text> {

    ...
Path file1;
Path file2;
...

    @Override
            protected void setup(Context context) throws IOException, InterruptedException {

                // Get the file from distributed cached
    file1 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[0];
    file2 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[1];

                // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap.
            }



            @Override
            protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
                    InterruptedException {
    ...
    }
    }

Using OLD mapred API (org.apache.hadoop.mapred.*) - 使用OLD mapred API（org.apache.hadoop.mapred。*）-

public static class ReduceJob extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

    ...
Path file1;
Path file2;
...

        @Override
        public void configure(JobConf job) {

                // Get the file from distributed cached
    file1 = DistributedCache.getLocalCacheFiles(job)[0]
    file2 = DistributedCache.getLocalCacheFiles(job)[1]
...

                // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap.
            }


@Override
        public synchronized void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output,
                Reporter reporter) throws IOException {
    ...
    }
    }

读取许多文件hadoop mapreduce分布式缓存

问题描述

1 个解决方案

解决方案1
3 2012-11-02 21:01:17

读取许多文件hadoop mapreduce分布式缓存

问题描述

1 个解决方案

解决方案1 3 2012-11-02 21:01:17

解决方案1
3 2012-11-02 21:01:17