简体   繁体   English

从Hadoop Mapreduce作业在HDFS上打开文件

[英]Opening files on HDFS from Hadoop mapreduce job

Usually, I can open a new file with something like this: 通常,我可以使用以下内容打开一个新文件:

aDict = {}
with open('WordLists/positive_words.txt', 'r') as f:
    aDict['positive'] = {line.strip() for line in f}

with open('WordLists/negative_words.txt', 'r') as f:
    aDict['negative'] = {line.strip() for line in f}

This will open up the two relevant text files in the WordLists folder and append each line to the dictionary as either positive or negative. 这将在WordLists文件夹中打开两个相关的文本文件,并将每一行作为肯定或否定附加到字典中。

When I want to run a mapreduce job within Hadoop however, I don't think this works. 但是,当我想在Hadoop中运行mapreduce作业时,我认为这行不通。 I am running my program like so: 我正在运行我的程序,如下所示:

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -file hadoop_map.py -mapper hadoop_reduce.py -input /toBeProcessed -output /Completed

I have tried to change the code to this: 我试图将代码更改为此:

with open('/mapreduce/WordLists/negative_words.txt', 'r')

where mapreduce is a folder on the HDFS, with WordLists a subfolder containing negative words. 其中mapreduce是HDFS上的文件夹,WordLists包含负词的子文件夹。 But my program doesn't find this. 但是我的程序找不到这个。 Is what I'm doing possible and if so, what is the correct way to load files on the HDFS. 我正在做的是可能的,如果可以的话,在HDFS上加载文件的正确方法是什么。

Edit 编辑

I've now tried: 我现在已经尝试:

with open('hdfs://localhost:9000/mapreduce/WordLists/negative_words.txt', 'r')

This seems to do something, but now I get this sort of output: 这似乎有所作为,但是现在我得到了这样的输出:

13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 50%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%

Then a job fail. 然后工作失败。 So still not right. 所以还是不对。 Any ideas? 有任何想法吗?

Edit 2: 编辑2:

Having re-read the API, I notice I can use the -files option in the terminal to specify files. 重新阅读API之后,我注意到我可以在终端中使用-files选项来指定文件。 The API states: API指出:

The -files option creates a symlink in the current working directory of the tasks that points to the local copy of the file. -files选项在任务的当前工作目录中创建指向该文件本地副本的符号链接。

In this example, Hadoop automatically creates a symlink named testfile.txt in the current working directory of the tasks. 在此示例中,Hadoop在任务的当前工作目录中自动创建一个名为testfile.txt的符号链接。 This symlink points to the local copy of testfile.txt. 此符号链接指向testfile.txt的本地副本。

-files hdfs://host:fs_port/user/testfile.txt

Therefore, I run: 因此,我运行:

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words -files hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words -file hadoop_map.py -mapper hadoop_map.py -input /toBeProcessed -output /Completed

From my understanding of the API, this creates symlinks so I can use "positive_words" and "negative_words" in my code, like this: 根据对API的理解,这会创建符号链接,因此我可以在代码中使用“ positive_words”和“ negative_words”,如下所示:

with open('negative_words.txt', 'r')

However, this still doesn't work. 但是,这仍然行不通。 Any help anyone can offer would be hugely appreciated as I can't do much until I solve this. 任何人都可以提供的任何帮助将不胜感激,因为在我解决此问题之前我无能为力。

Edit 3: 编辑3:

I can use this command: 我可以使用以下命令:

-file ~/Twitter/SentimentWordLists/positive_words.txt

along with the rest of my command to run the Hadoop job. 以及我其余的命令来运行Hadoop作业。 This finds the file on my local system rather than HDFS. 这将在我的本地系统而不是HDFS上找到文件。 This doesn't throw any errors, so it's accepted somewhere as a file. 不会引发任何错误,因此它在某处被接受为文件。 However, I've no idea how to access the file. 但是,我不知道如何访问该文件。

Solution after plenty comments :) 经过大量评论后的解决方案:)

Reading a data file in python: send it with -file and add to your script the following: 在python中读取数据文件:使用-file发送数据并将以下内容添加到脚本中:

import sys

Sometimes is needed to add after the import : 有时需要在import后添加:

sys.path.append('.')

(related to @DrDee comment in Hadoop Streaming - Unable to find file error ) (与Hadoop Streaming中的 @DrDee注释有关-无法找到文件错误

when dealing with HDFS programatically you should look into FileSystem, FileStatus, and Path. 以编程方式处理HDFS时,应查看FileSystem,FileStatus和Path。 These are hadoop API classes which allow you to access HDFS within your program. 这些是hadoop API类,使您可以在程序中访问HDFS。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM