hadoop Inverted index count

Question

I have two files as input:

fileA.txt:

learn hadoop
learn java

fileB.txt:

hadoop java
eclipse eclipse

Desired Output:

learn   fileA.txt:2

hadoop  fileA.txt:1 , fileB.txt:1

java    fileA.txt:1 , fileB.txt:1

eclipse fileB.txt:2

My reduce method:

public void reduce(Text key, Iterator<Text> values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {

            Set<Text> outputValues = new HashSet<Text>();
            while (values.hasNext()) {
                Text value = new Text(values.next());
                // delete duplicates
                outputValues.add(value);
            }
            boolean isfirst = true;
            StringBuilder toReturn = new StringBuilder();
            Iterator<Text> outputIter = outputValues.iterator();
            while (outputIter.hasNext()) {
                if (!isfirst) {
                    toReturn.append("/");
                }
                isfirst = false;
                toReturn.append(outputIter.next().toString());
            }
            output.collect(key, new Text(toReturn.toString()));
        }

I need help with the counter(count the words by file)

I managed to print:

learn   fileA.txt

hadoop  fileA.txt / fileB.txt

java    fileA.txt / fileB.txt

eclipse fileB.txt

but cannot print the count per file

Any help will be much appreciated

Answer 1

as i understand this should print what you want:

@Override
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
    Map<String, Integer> fileToCnt = new HashMap<String, Integer>();
    while(values.hasNext()) {
        String file = values.next().toString();
        Integer current = fileToCnt.get(file);
        if (current == null) {
            current = 0;
        }
        fileToCnt.put(file, current + 1);
    }
    boolean isfirst = true;
    StringBuilder toReturn = new StringBuilder();
    for (Map.Entry<String, Integer> entry : fileToCnt.entrySet()) {
        if (!isfirst) {
            toReturn.append(", ");
        }
        isfirst = false;
        toReturn.append(entry.getKey()).append(":").append(entry.getValue());
    }
    output.collect(key, new Text(toReturn.toString()));
}

hadoop Inverted index count

Question

1 answers

solution1
1 ACCPTED 2014-05-01 17:40:17

hadoop Inverted index count

Question

1 answers

solution1 1 ACCPTED 2014-05-01 17:40:17

solution1
1 ACCPTED 2014-05-01 17:40:17