從Hadoop Mapreduce作業在HDFS上打開文件

Question

通常，我可以使用以下內容打開一個新文件：

aDict = {}
with open('WordLists/positive_words.txt', 'r') as f:
    aDict['positive'] = {line.strip() for line in f}

with open('WordLists/negative_words.txt', 'r') as f:
    aDict['negative'] = {line.strip() for line in f}

這將在WordLists文件夾中打開兩個相關的文本文件，並將每一行作為肯定或否定附加到字典中。

但是，當我想在Hadoop中運行mapreduce作業時，我認為這行不通。 我正在運行我的程序，如下所示：

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -file hadoop_map.py -mapper hadoop_reduce.py -input /toBeProcessed -output /Completed

我試圖將代碼更改為此：

with open('/mapreduce/WordLists/negative_words.txt', 'r')

其中mapreduce是HDFS上的文件夾，WordLists包含負詞的子文件夾。 但是我的程序找不到這個。 我正在做的是可能的，如果可以的話，在HDFS上加載文件的正確方法是什么。

編輯

我現在已經嘗試：

with open('hdfs://localhost:9000/mapreduce/WordLists/negative_words.txt', 'r')

這似乎有所作為，但是現在我得到了這樣的輸出：

13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 50%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%

然后工作失敗。 所以還是不對。 有任何想法嗎？

編輯2：

重新閱讀API之后，我注意到我可以在終端中使用-files選項來指定文件。 API指出：

-files選項在任務的當前工作目錄中創建指向該文件本地副本的符號鏈接。

在此示例中，Hadoop在任務的當前工作目錄中自動創建一個名為testfile.txt的符號鏈接。 此符號鏈接指向testfile.txt的本地副本。

-files hdfs://host:fs_port/user/testfile.txt

因此，我運行：

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words -files hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words -file hadoop_map.py -mapper hadoop_map.py -input /toBeProcessed -output /Completed

根據對API的理解，這會創建符號鏈接，因此我可以在代碼中使用“ positive_words”和“ negative_words”，如下所示：

with open('negative_words.txt', 'r')

但是，這仍然行不通。 任何人都可以提供的任何幫助將不勝感激，因為在我解決此問題之前我無能為力。

編輯3：

我可以使用以下命令：

-file ~/Twitter/SentimentWordLists/positive_words.txt

以及我其余的命令來運行Hadoop作業。 這將在我的本地系統而不是HDFS上找到文件。 這不會引發任何錯誤，因此它在某處被接受為文件。 但是，我不知道如何訪問該文件。

Answer 1

經過大量評論后的解決方案:)

在python中讀取數據文件：使用-file發送數據並將以下內容添加到腳本中：

import sys

有時需要在import后添加：

sys.path.append('.')

（與Hadoop Streaming中的 @DrDee注釋有關-無法找到文件錯誤）

Answer 2

以編程方式處理HDFS時，應查看FileSystem，FileStatus和Path。 這些是hadoop API類，使您可以在程序中訪問HDFS。

從Hadoop Mapreduce作業在HDFS上打開文件

問題描述

2 個解決方案

解決方案1
2 已采納 2013-08-28 09:12:58

解決方案2
0 2013-08-28 00:26:05

從Hadoop Mapreduce作業在HDFS上打開文件

問題描述

2 個解決方案

解決方案1 2 已采納 2013-08-28 09:12:58

解決方案2 0 2013-08-28 00:26:05

解決方案1
2 已采納 2013-08-28 09:12:58

解決方案2
0 2013-08-28 00:26:05