简体   繁体   中英

Hadoop Mapreduce (Java) - error counting all unique words in text with Reducer as Combiner

I adapted the standard word count Hadoop example to count all the unique words from a series of input text files using a user-defined counter, with an enum defined in the driver class like so:

public enum Operations { UNIQUE_WC }

My code in the Reducer is as follows:

public class WordCountReducer extends Reducer <Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) 
   throws IOException, InterruptedException {
   int sum = 0;

   for (IntWritable val : values) {
        sum += val.get();
    }

   result.set(sum);
   context.write(key, result);    
   context.getCounter(WordCountJobControl.Operations.UNIQUE_WC).increment(1);
    }
}

When the Reducer class is set as the Combiner, this results in odd behaviour. Instead of receiving the value of the Reduce Input Groups/Reduce Output Records, the counter receives the sum of Reduce Input Groups and Reduce Input Records ie unique words plus total words, or keys plus values.

Can anyone help me understand the logic behind why this is happening? From what I understand (presumably wrongly), doing this should if anything have reduced the count given.

Following is an example:

Suppose we have two files file1 & file2.

File1 contains: word1 word2 word3 word1

File2 contains: word1 word2

After mapping we get the following output from two map functions(one for each file):

For file1: word1,1
word2,1
word3,1
word1,1

For file2: word1,1
word2,1

These are then combined using a combiner which is the same as the reducer function. The key value pairs become:

For file1: word1,2
word2,1
word3,1

File2 remains the same. The reducer is applied for each, so we will have 3 reducer functions(one for each word) to get the total count. The issue you are facing is that if the counter is incremented in the reducer & combiner stage, then the counter gets incremented for each word in the file1 & file2 and then again the counter gets incremented at the reduce stage for each word(reduce function call). The whole point being the combiner works to combine same keys for a particular file(not on all the keys across multiple files). The counter should not be incremented in the combiner stage.

What you are doing is: Map stage: Counter=0 Combine Stage: At file 1:Counter=4 At file 2:Counter=previous value+2 After combine stage value is 6. Reduce Stage: For each key counter gets incremented. So counter becomes 9.

Hope that clears your question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM