簡體   English   中英

如何使用Hadoop Streaming在本地Hadoop集群中運行MRJob?

[英]How to run a MRJob in a local Hadoop Cluster with Hadoop Streaming?

我目前正在上大數據課,我的一個項目是在本地設置的Hadoop集群上運行Mapper / Reducer。

我一直在為該類使用Python和MRJob庫。

這是我當前的Mapper / Reducer Python代碼。

from mrjob.job import MRJob
from mrjob.step import MRStep
import re
import os

WORD_RE = re.compile(r"[\w']+")
choice = ""

class MRPrepositionsFinder(MRJob):

def steps(self):
    return [
        MRStep(mapper=self.mapper_get_words),
        MRStep(reducer=self.reducer_find_prep_word)
    ]

def mapper_get_words(self, _, line):
    # set word_list to indicators, convert to lowercase, and strip whitespace
    word_list = set(line.lower().strip() for line in open("/hdfs/user/user/indicators.txt"))

    # set filename to map_input_file
    fileName = os.environ['map_input_file']
    # itterate through each word in line
    for word in WORD_RE.findall(line):
        # if word is in indicators, yield chocie as filename
        if word.lower() in word_list:
            choice = fileName.split('/')[5]
            yield (choice, 1)

def reducer_find_prep_word(self, choice, counts):
    # each item of choice is (choice, count),
    # so yielding results in value=choice, key=count
    yield (choice, sum(counts))


if __name__ == '__main__':
MRPrepositionsFinder.run()

當我嘗試在Hadoop集群上運行代碼時-我使用了以下命令:

python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output

不幸的是,每次我運行命令時,都會出現以下錯誤:

No configs found; falling back on auto-configuration
STDERR: Error: JAVA_HOME is not set and could not be found.
Traceback (most recent call last):
  File "hrc_discover.py", line 37, in 
    MRPrepositionsFinder.run()
  File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/job.py", line 432, in run
    mr_job.execute()
  File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/job.py", line 453, in execute
    super(MRJob, self).execute()
  File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/launch.py", line 161, in execute
    self.run_job()
  File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/launch.py", line 231, in run_job
    runner.run()
  File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/runner.py", line 437, in run
    self._run()
  File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 346, in _run
    self._find_binaries_and_jars()
  File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 361, in _find_binaries_and_jars
    self.get_hadoop_version()
  File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 198, in get_hadoop_version
    return self.fs.get_hadoop_version()
  File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/fs/hadoop.py", line 117, in get_hadoop_version
    stdout = self.invoke_hadoop(['version'], return_stdout=True)
  File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/fs/hadoop.py", line 172, in invoke_hadoop
    raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/usr/bin/hadoop', 'version']' returned non-zero exit status 1

我環顧了互聯網,發現我需要導出JAVA_HOME變量-但我不想設置任何可能破壞我的設置的東西。

任何幫助,將不勝感激,謝謝!

看來問題出在etc/hadoop/hadoop-env.sh腳本文件中。

JAVA_HOME環境變量配置為:

export JAVA_HOME=$(JAVA_HOME)

因此,我繼續將其更改為以下內容:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk

我試圖再次運行以下命令,希望它能起作用:

python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output

幸運的是,MRJob在JAVA_HOME環境中開始學習,並產生以下輸出:

No configs found; falling back on auto-configuration
Using Hadoop version 2.7.3
Looking for Hadoop streaming jar in /home/hadoop/contrib...
Looking for Hadoop streaming jar in /usr/lib/hadoop-mapreduce...
Hadoop streaming jar not found. Use --hadoop-streaming-jar
Creating temp directory /tmp/hrc_discover.user.20170306.022649.449218
Copying local files to hdfs:///user/user/tmp/mrjob/hrc_discover.user.20170306.022649.449218/files/...
.. 

為了解決Hadoop流媒體jar的問題,我在命令中添加了以下開關:

--hadoop-streaming-jar /usr/lib/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar

完整的命令如下所示:

python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-streaming-jar /usr/lib/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output

結果為以下結果:

No configs found; falling back on auto-configuration
Using Hadoop version 2.7.3
Creating temp directory /tmp/hrc_discover.user.20170306.022649.449218
Copying local files to hdfs:///user/user/tmp/mrjob/hrc_discover.user.20170306.022649.449218/files/...

看來問題已解決,Hadoop應該處理我的工作。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM