简体   繁体   English

WordCount MapReduce提供了意外结果

[英]WordCount MapReduce is giving unexpected result

I am trying this java code for wordcount in mapreduce and after completion of reduce method I want to display the only word that comes maximum number of times. 我正在mapreduce中尝试此Java代码进行wordcount操作,在reduce方法完成后,我想显示出现次数最多的唯一单词。

For that i have created some class level variables named as myoutput, mykey and completeSum. 为此,我创建了一些名为myoutput,mykey和completeSum的类级别变量。

I am writing this data in close method but I am getting unexpected result at the end. 我正在用close方法写入此数据,但最后得到了意外结果。

public class WordCount {

public static class Map extends MapReduceBase implements
        Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);

        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            output.collect(word, one);
        }

    }
}

static int completeSum = -1;
static OutputCollector<Text, IntWritable> myoutput;
static Text mykey = new Text();

public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }

        if (completeSum < sum) {
            completeSum = sum;
            myoutput = output;
            mykey = key;
        }


    }

    @Override
    public void close() throws IOException {
        // TODO Auto-generated method stub
        super.close();
        myoutput.collect(mykey, new IntWritable(completeSum));
    }
}

public static void main(String[] args) throws Exception {

    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    // conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);

}
}

input file data 输入文件数据

one 
three three three
four four four four 
 six six six six six six six six six six six six six six six six six six 
five five five five five 
seven seven seven seven seven seven seven seven seven seven seven seven seven 

result should come as 结果应该是

six 18

however I am getting this result 但是我得到这个结果

three 18

By the result I can see that the sum is correct but the key is not. 通过结果,我可以看到总和是正确的,但关键不是。

If someone can give good reference on these map and reduce methods, that would be very helpful. 如果有人可以在这些地图上提供良好的参考并减少方法,那将非常有帮助。

The problem you are observing is due to reference aliasing. 您正在观察的问题是由于引用别名引起的。 The object referenced by the key is reused with a new content for multiple invocations, thus changing mykey that references the same object. key引用的对象将与新内容一起重用于多次调用,从而更改了引用同一对象的mykey It ends up with the last reduced key. 它以最后一个减小的键结束。 This could be avoided by copying the object, as in: 可以通过复制对象来避免这种情况,如下所示:

mykey = new Text(key);

However, you should get the result only from the output file as static variables cannot be shared by different nodes in a distributed cluster. 但是,您应该仅从输出文件中获得结果,因为static变量不能由分布式集群中的其他节点共享。 It sort of works only in standalone mode, defeating the purpose of map-reduce. 它只能在独立模式下工作,无法达到map-reduce的目的。

Finally, using global variables, even in standalone mode, will most likley lead to races if using parallel local tasks (see MAPREDUCE-1367 and MAPREDUCE-434 ). 最后,即使使用独立模式,使用全局变量(即使在独立模式下)也会在使用并行本地任务时导致竞赛(请参阅MAPREDUCE-1367MAPREDUCE-434 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM