合並MapReduce作業的輸出文件

Question

我已經用Python編寫了Mapper和Reducer，並已使用Hadoop Streaming在Amazon的Elastic MapReduce（EMR）上成功執行了它。

最終結果文件夾在三個不同的文件part-00000，part-00001和part-00002中包含輸出。 但是我需要將輸出作為一個文件。 有辦法嗎？

這是我的Mapper代碼：

#!/usr/bin/env python

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)

這是我的Reducer代碼

#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None
max_count=0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

if current_word == word:
    current_count += count
else:
    if current_word:
        # write result to STDOUT
            if current_word[0] != '@':
                print '%s\t%d' % (current_word, current_count)
                if count > max_count:
                    max_count = count
    current_count = count
    current_word = word

if current_word == word:
    print '%s\t%s' % (current_word, current_count)

我需要將此輸出作為一個文件。

Answer 1

一種非常簡單的方法（ 假設使用Linux / UNIX系統 ）：

$ cat part-00000 part-00001 part-00002 > output

Answer 2

對小型數據集/處理使用單個reduce或在作業的輸出文件上使用getmerge選項。

Answer 3

我對上述問題的解決方案是執行以下hdfs命令：

hadoop fs -getmerge /hdfs/path local_file

/ hdfs / path是包含作業輸出的所有部分（part-*****）的路徑。 hadoop fs的-getmerge選項會將所有作業輸出合並到本地文件系統上的單個文件中。

Answer 4

最近我遇到了同樣的問題，實際上組合器應該執行此任務，但是我無法以某種方式實現。 我是做什么的？

第一步：mapper1.py reducer1.py
輸入：s3：//../data/
輸出s3：//..../small_output/
第二步：mapper2.py reducer2.py
輸入s3：//../data/
輸出：s3：//..../output2/
第三步：mapper3.py reducer3.py
輸入：s3：//../output2/
輸出：s3：//..../final_output/

我假設我們需要將step1的輸出作為step3的單個文件。

在mapper2.py的頂部，有以下代碼；

if not os.path.isfile('/tmp/s3_sync_flag'):
    os.system('touch /tmp/s3_sync_flag')
    [download files to /tmp/output/]
    os.system('cat /tmp/output/part* > /tmp/output/all')

如果阻塞，則檢查是否執行多個映射器。

合並MapReduce作業的輸出文件

問題描述

4 個解決方案

解決方案1
1 2013-12-14 09:08:37

解決方案2
0 2013-12-14 10:20:45

解決方案3
0

解決方案4
0 2015-04-01 16:46:02

合並MapReduce作業的輸出文件

問題描述

4 個解決方案

解決方案1 1 2013-12-14 09:08:37

解決方案2 0 2013-12-14 10:20:45

解決方案3 0

解決方案4 0 2015-04-01 16:46:02

解決方案1
1 2013-12-14 09:08:37

解決方案2
0 2013-12-14 10:20:45

解決方案3
0

解決方案4
0 2015-04-01 16:46:02