文件未在AWS Elastic Map Reduce上緩存

Question

我在AWS Elastic MapReduce上運行以下MapReduce：

./elastic-mapreduce --create --stream --name CLI_FLOW_LARGE --mapper s3：//classify.mysite.com/mapper.py --reducer s3：//classify.mysite.com/reducer.py --input s3n：//classify.mysite.com/s3_list.txt-輸出s3：//classify.mysite.com/dat_output4/-緩存s3n：//classify.mysite.com/classifier.py#classifier.py-緩存存檔s3n：//classify.mysite.com/policies.tar.gz#policies --bootstrap-action s3：//classify.mysite.com/bootstrap.sh --enable-debugging --master-instance-type m1.large --slave-instance-type m1.large --instance-type m1.large

由於某種原因，似乎沒有緩存cacheFile classifier.py 。 當reducer.py嘗試導入它時，出現此錯誤：

  File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201204290242_0001/attempt_201204290242_0001_r_000000_0/work/./reducer.py", line 12, in <module>
    from classifier import text_from_html, train_classifiers
ImportError: No module named classifier

classifier.py絕對存在於s3n://classify.mysite.com/classifier.py 。 值得一提的是，策略存檔似乎可以正常加載。

Answer 1

我不知道如何在EC2中解決此問題，但是我之前在傳統的Hadoop部署中使用Python時就已經看到了。 希望這一課能結束。

我們需要做的是將目錄reduce.py添加到python路徑中，因為大概classifier.py也位於其中。 無論出於什么原因，這個地方都不在python路徑中，所以它找不到classifier 。

import sys
import os.path

# add the directory where reducer.py is to the python path
sys.path.append(os.path.dirname(__file__))
# __file__ is the location of reduce.py, along with "reduce.py"
# dirname strips the file name and only gives the directory
# sys.path is the python path where it looks for modules

from classifier import text_from_html, train_classifiers

您的代碼可能在本地工作的原因是因為您正在其中運行當前代碼。 就當前工作目錄而言，Hadoop可能不在您所在的位置運行它。

Answer 2

他的評論對此表示贊賞。 不得不附加工作目錄系統路徑：

sys.path.append('./')

另外，我建議與我有類似問題的任何人閱讀有關在AWS上使用分布式緩存的出色文章： https : //forums.aws.amazon.com/message.jspa? messageID =152538

文件未在AWS Elastic Map Reduce上緩存

問題描述

2 個解決方案

解決方案1
4 已采納 2012-05-01 02:57:42

解決方案2
1 2012-05-01 19:27:41

文件未在AWS Elastic Map Reduce上緩存

問題描述

2 個解決方案

解決方案1 4 已采納 2012-05-01 02:57:42

解決方案2 1 2012-05-01 19:27:41

解決方案1
4 已采納 2012-05-01 02:57:42

解決方案2
1 2012-05-01 19:27:41