如何通過識別python Hadoop中的鍵來處理Mapreduce

Question

我有兩個來自地圖函數的關鍵值：NY和其他。 因此，我的密鑰的輸出為：NY 1或Other1。僅這兩種情況。

我的地圖功能：

    #!/usr/bin/env python
    import sys
    import csv
    import string

    reader = csv.reader(sys.stdin, delimiter=',')
    for entry in reader:
        if len(entry) == 22:
            registration_state=entry[16]
            print('{0}\t{1}'.format(registration_state,int(1)))

現在，我需要使用reducer來處理地圖輸出。 我的減少：

#!/usr/bin/env python
import sys
import string


currentkey = None
ny = 0
other = 0
# input comes from STDIN (stream data that goes to the program)
for line in sys.stdin:

    #Remove leading and trailing whitespace
    line = line.strip()

    #Get key/value 
    key, values = line.split('\t', 1)  
    values = int(values)
#If we are still on the same key...
    if key == 'NY':
        ny = ny + 1
    #Otherwise, if this is a new key...
    else:
        #If this is a new key and not the first key we've seen
        other = other + 1


#Compute/output result for the last key 
print('{0}\t{1}'.format('NY',ny))
print('{0}\t{1}'.format('Other',other))

通過這些，mapreduce將提供兩個輸出結果文件，每個文件都包含NY和Others輸出。 即包含：NY 1248，其他4677； 另一個：NY 0，其他1000。這是因為兩個減法從地圖上拆分了輸出，所以生成了兩個結果，通過合並（合並）最終的輸出將是結果。

但是，我想更改我的reduce或map函數，使其每個還原過程僅在一個鍵上進行，即一個還原過程僅將NY作為鍵值，而另一個在Other上進行。 我希望得到類似以下內容的結果：

NY 1258, Others 0; Another: NY 0, Others 5677.

如何調整功能以達到預期效果？

Answer 1

可能您需要使用Python迭代器和生成器。 這個鏈接是一個很好的例子。 我嘗試用相同的代碼重新編寫代碼（未經測試）

映射器：

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def main(separator='\t'):
    reader = csv.reader(sys.stdin, delimiter=',')
    for entry in reader:
    if len(entry) == 22:
        registration_state=entry[16]
        print '%s%s%d' % (registration_state, separator, 1)

if __name__ == "__main__":
    main()

減速器：

!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator='\t'):
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_count = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            # count was not a number, so silently discard this item
            pass

if __name__ == "__main__":
    main()

如何通過識別python Hadoop中的鍵來處理Mapreduce

問題描述

1 個解決方案

解決方案1
0 2018-03-04 04:25:07

如何通過識別python Hadoop中的鍵來處理Mapreduce

問題描述

1 個解決方案

解決方案1 0 2018-03-04 04:25:07

解決方案1
0 2018-03-04 04:25:07