簡體   English   中英

使用python MRJob在EMR上引導庫

[英]Bootstrapping libraries on EMR using python MRJob

問題陳述:

我正在嘗試使用python MRJob庫在Amazon EMR中運行map-reduce作業,並且在用必需的庫和程序包引導節點時遇到了麻煩。

細節:

我的示例python mrjob代碼:

    import re
    from mrjob.job import MRJob
    from sentClassifier import sentClassify
    import nltk

    .. do something ..

有一些像NLTK這樣的庫需要導入,還有一些我要導入的本地模塊,如from sentClassifier import sentClassify

我想知道什么是引導EMR節點的最佳方法,以便可以使用這些方法和程序包。 該代碼在我的本地計算機上運行良好。

我的樣本mrjob.conf文件:

    runners:
      emr:
        aws_access_key_id: ***
        aws_secret_access_key: ***
        ec2_core_instance_type: m1.large
        ec2_key_pair: mykey
        ec2_key_pair_file: mykey.pem
        num_ec2_core_instances: 5
        pool_wait_minutes: 2
        pool_emr_job_flows: true
        ssh_tunnel_is_open: true
        ssh_tunnel_to_job_tracker: true
      hadoop:
        setup:
          - virtualenv venv
          - . venv/bin/activate
          - pip install mr3po simplejson
          - sudo easy_install https://code.google.com/p/nltk/downloads/detail?name=nltk-2.0b9-py2.6.egg&can=2&q=

但是工作失敗了。

我通讀了以下參考資料,並嘗試了所有各種方法,但還是沒有運氣:

錯誤日志:

    Scanning SSH logs for probable cause of failure
    Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
    Traceback (most recent call last):
    File "obidroidMR.py", line 5, in <module>
       import nltk
       ImportError: No module named nltk
       (while reading from s3://mrjob-   51b9493c1a467671/tmp/obidroidMR.shreyas.20140503.012933.336228/files/STDIN)
       Attempting to terminate job...
       Job appears to have already been terminated
       Killing our SSH tunnel (pid 12909)
       Traceback (most recent call last):
         File "obidroidMR.py", line 107, in <module>
         ObidroidReview.run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
         mr_job.execute()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
         self.run_job()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 809, in _run
         self._wait_for_job_to_complete()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 1599, in _wait_for_job_to_complete
         raise Exception(msg)
         Exception: Job on job flow j-2R8G1Q3RIE9ED failed with status WAITING: Waiting after step failed
         Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
         Traceback (most recent call last):
         File "obidroidMR.py", line 5, in <module>
         import nltk
         ImportError: No module named nltk

任何幫助將非常感激

mrjob.conf中,安裝軟件包所需的行可能不在應有的行。 應該在emr:而不是hadoop:這是在本地Hadoop安裝上運行作業時的配置)下列出應在EMR上運行的作業應適用的事項。

如果這是一個簡單的Linux命令,例如pipapt-get ,那么您應該能夠像這樣安裝軟件包:

runners:
  emr:
    aws_access_key_id: ***
    ... all the other stuff ...
    bootstrap_cmds:
    - sudo apt-get install -y python-boto
    - sudo pip install simplejson

我從來沒有嘗試過專門安裝NLTK,所以在那里我無法為您提供幫助,但是您應該可以按照此方式進行安裝。

對於一個潛在的更復雜的安裝,我會建議ssh荷蘭國際集團到與EMR CLI的主節點:

$ ./elastic-mapreduce -j JOB_FLOW_ID --ssh

然后嘗試安裝該軟件包。 如果找到成功安裝軟件包的一系列Shell命令,則只需將其復制並粘貼到mrjob.conf

鑒於Amazon Elastic Map Reduce使用基於Amazon Linux的AMI ,我已驗證可以在Amazon Linux AMI 2014.03.1-ami-fb8e9292(64位)上安裝nltk

sudo easy_install -U pip
sudo easy_install -U distribute
sudo pip install -U pyyaml nltk

您可以嘗試將這3行合並到您的mrjob.conf中

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM