简体   繁体   中英

How to deal with Mapreduce by identifying the keys in python Hadoop

I have two key values from map function: NY and Others. so, the output of my key is: NY 1, or Other 1. Only these two cases.

my map function:

    #!/usr/bin/env python
    import sys
    import csv
    import string

    reader = csv.reader(sys.stdin, delimiter=',')
    for entry in reader:
        if len(entry) == 22:
            registration_state=entry[16]
            print('{0}\t{1}'.format(registration_state,int(1)))

Now i need to use reducers to process the map outputs. My reduce:

#!/usr/bin/env python
import sys
import string


currentkey = None
ny = 0
other = 0
# input comes from STDIN (stream data that goes to the program)
for line in sys.stdin:

    #Remove leading and trailing whitespace
    line = line.strip()

    #Get key/value 
    key, values = line.split('\t', 1)  
    values = int(values)
#If we are still on the same key...
    if key == 'NY':
        ny = ny + 1
    #Otherwise, if this is a new key...
    else:
        #If this is a new key and not the first key we've seen
        other = other + 1


#Compute/output result for the last key 
print('{0}\t{1}'.format('NY',ny))
print('{0}\t{1}'.format('Other',other))

From these, the mapreduce will give two output result files, each contains both NY and Others outputs. ie one contains: NY 1248, Others 4677; another one: NY 0, Others 1000. This is because two reduced split the output from the map, so generated two results, by combining (merge) the final output will be the result.

However, I would like to change my reduce or map functions to have each reduced process on only one key, ie one reduced only deal with NY as the key values, and another one works on Other. I expect to have results like one contains:

NY 1258, Others 0; Another: NY 0, Others 5677. 

How can I adjust my functions to achieve results I expect?

Probably you need to use Python iterators and generators. An excellent example is given this link . I have tried re-writing your code with the same (not tested)

Mapper:

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def main(separator='\t'):
    reader = csv.reader(sys.stdin, delimiter=',')
    for entry in reader:
    if len(entry) == 22:
        registration_state=entry[16]
        print '%s%s%d' % (registration_state, separator, 1)

if __name__ == "__main__":
    main()

Reducer:

!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator='\t'):
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_count = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            # count was not a number, so silently discard this item
            pass

if __name__ == "__main__":
    main()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM