[英]Running MapReduce from Jupyter Notebook
我正在嘗試在u.data文件中的數據集上從Jupyter Notebook運行MapReduce,但是我一直收到一條錯誤消息,內容為
“ TypeError:'str'對象不支持項目刪除”。
如何使代碼成功運行?
u.data包含如下信息:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
這是代碼:
from mrjob.job import MRJob
class MRRatingCounter(MRJob):
def mapper(self, key, line):
(userID, movieID, rating, timestamp) = line.split("\t")
yield rating, 1
def reducer(self, rating, occurences):
yield rating, sum(occurences)
if __name__ == "main__":
MRRatingCounter.run()
filepath = "u.data"
MRRatingCounter(filepath)
如果此代碼保存在.py文件下,並使用命令行,則該代碼將成功運行: !python ratingCounter.py u.data
MRRatingCounter需要存在於自己的.py文件中,例如說MRRatingCounter.py:
from mrjob.job import MRJob
class MRRatingCounter(MRJob):
def mapper(self, key, line):
(userID, movieID, rating, timestamp) = line.split("\t")
yield rating, 1
def reducer(self, rating, occurences):
yield rating, sum(occurences)
if __name__ == "__main__":
MRRatingCounter.run()
將類導入筆記本並通過運行器執行:
from MRRatingCounter import MRRatingCounter
mr_job = MRRatingCounter(args=['u.data'])
with mr_job.make_runner() as runner:
runner.run()
for line in runner.stream_output():
#handle each line however you like
print line
就像您提到的那樣,重要的部分是將文件保存為.py格式,為此必須包括%%file filename.py
在這種情況下,我添加了rc.py作為文件名,所有代碼都進入一個單元格:
%%file rc.py
from mrjob.job import MRJob
class MRRatingCounter(MRJob):
def mapper(self, key, line):
(userId, movieId, rating, timestamp) = line.split('\t')
yield rating, 1
def reducer(self, rating, occurances):
yield rating, sum(occurances)
if __name__ == '__main__':
MRRatingCounter.run()
一旦運行單元,就可以在下一個單元中運行以下命令:
!python rc.py u.data
這將為您提供所需的輸出。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.