简体   繁体   中英

MapReduce (Python) - How to sort reducer output for Top-N list?

I'm pretty new to MapReduce. Currently trying to complete the udacity course on Hadoop MapReduce.

I have a mapper that parses a forum nodes, and I will get the tags associated with each node. My objective is to sort the top 10 tags.

An example output:

video   1
cs101   1
meta    1
bug     1
issues  1
nationalities   1
cs101   1
welcome 1
cs101   1
cs212   1
cs262   1
cs253   1
discussion      1
meta    1

It is pretty easy to add them all up in reducer:

#!/usr/bin/python

import sys
import string

total = 0
oldKey = None

for line in sys.stdin:
    data_mapped = line.strip().split("\t")

    if(len(data_mapped) != 2):
        print "====================="
        print line.strip()
        print "====================="
        continue

    key, value = data_mapped

    if oldKey and oldKey != key:
        print total, "\t", oldKey
        oldKey = key;
        total = 0

    oldKey = key
    total += float(value)

if oldKey != None:
    print total, "\t", oldKey

Output:

1.0     application
1.0     board
1.0     browsers
1.0     bug
8.0     cs101
1.0     cs212
5.0     cs253
1.0     cs262
1.0     deadlines
1.0     digital
5.0     discussion
1.0     google-appengine
2.0     homework
1.0     html
1.0     hungarian
1.0     hw2-1
3.0     issues
2.0     jobs
2.0     lessons

I know that the keys are sorted in the output of a mapper, hence I just test if the keys are still the same tag. If not, then I'll output the # of times that a tag appears.

However, the problem is how do I sort this list?

I can sort the list in python if I store all the information in a list or a dictionary. However, it seems like a bad design decision, because if you have a lot of different tags, you will go out of memory.

I believe you can use the collections.Counter class here:

Example: ( modified from your code )

#!/usr/bin/python

import sys
import collections

counter = collections.Counter()

for line in sys.stdin:
    k, v = line.strip().split("\t", 2)

    counter[k] += int(v)

print counter.most_common(10)

The collections.Counter() class implements this exact use-case and many other common use-cases around counting things and collecting various stats, etc.

8.3.1. Counter objects A counter tool is provided to support convenient and rapid tallies. For example:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM