简体   繁体   中英

hadoop Inverted index count

I have two files as input:


learn hadoop
learn java


hadoop java
eclipse eclipse

Desired Output:

learn   fileA.txt:2

hadoop  fileA.txt:1 , fileB.txt:1

java    fileA.txt:1 , fileB.txt:1

eclipse fileB.txt:2

My reduce method:

public void reduce(Text key, Iterator<Text> values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {

            Set<Text> outputValues = new HashSet<Text>();
            while (values.hasNext()) {
                Text value = new Text(values.next());
                // delete duplicates
            boolean isfirst = true;
            StringBuilder toReturn = new StringBuilder();
            Iterator<Text> outputIter = outputValues.iterator();
            while (outputIter.hasNext()) {
                if (!isfirst) {
                isfirst = false;
            output.collect(key, new Text(toReturn.toString()));

I need help with the counter(count the words by file)

I managed to print:

learn   fileA.txt

hadoop  fileA.txt / fileB.txt

java    fileA.txt / fileB.txt

eclipse fileB.txt

but cannot print the count per file

Any help will be much appreciated

as i understand this should print what you want:

public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
    Map<String, Integer> fileToCnt = new HashMap<String, Integer>();
    while(values.hasNext()) {
        String file = values.next().toString();
        Integer current = fileToCnt.get(file);
        if (current == null) {
            current = 0;
        fileToCnt.put(file, current + 1);
    boolean isfirst = true;
    StringBuilder toReturn = new StringBuilder();
    for (Map.Entry<String, Integer> entry : fileToCnt.entrySet()) {
        if (!isfirst) {
            toReturn.append(", ");
        isfirst = false;
    output.collect(key, new Text(toReturn.toString()));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM