从IPython笔记本运行MRJob

Question

I'm trying to run mrjob example from IPython notebook 我试图从IPython笔记本运行mrjob示例

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

def reducer(self, key, values):
    yield key, sum(values)

then run it with code 然后用代码运行它

mr_job = MRWordFrequencyCount(args=["testfile.txt"])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

and getting the error: 并得到错误：

TypeError: <module '__main__' (built-in)> is a built-in class

Is there way to run mrjob from IPython notebook? 有没有办法从IPython笔记本运行mrjob？

Answer 1

I suspect it is due to this limitation stated on the MRJob website: 我怀疑是由于MRJob网站上的这一限制：

The file with the job class is sent to Hadoop to be run. 具有作业类的文件将发送到Hadoop以进行运行。 Therefore, the job file cannot attempt to start the Hadoop job, or you would be recursively creating Hadoop jobs!The code that runs the job should only run outside of the Hadoop context. 因此，作业文件无法尝试启动Hadoop作业，或者您将以递归方式创建Hadoop作业！运行作业的代码应仅在Hadoop上下文之外运行。

Alternatively, it might be because you didn't have the following ( reference ): 或者，可能是因为您没有以下（参考）：

if __name__ == '__main__':  
  MRWordCounter.run()  # where MRWordCounter is your job class

Answer 2

I haven't found the "perfect way" yet, but one thing you can do is create one notebook cell, using the %%file magic, writing the cell contents to a file: 我还没有找到“完美的方式”，但你可以做的一件事是创建一个笔记本单元格，使用%%file魔术，将单元格内容写入文件：

%%file wordcount.py
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)

And then have mrjob run that file in a later cell: 然后让mrjob在稍后的单元格中运行该文件：

import wordcount
reload(wordcount)

mr_job = wordcount.MRWordFrequencyCount(args=['example.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        print key, value

Notice that I called my file wordcount.py and that I import the class MRWordFrequencyCount from the wordcount module -- the filename and module has to match. 请注意，我调用了我的文件wordcount.py并从wordcount模块导入了类MRWordFrequencyCount - 文件名和模块必须匹配。 Also Python caches imported modules and when you change the wordcount.py -file iPython will not reload the module but rather used the old, cached one. Python也会缓存导入的模块，当你更改wordcount.py文件时，iPython不会重新加载模块，而是使用旧的缓存模块。 That's why I put the reload() call in there. 这就是我把reload()调用放在那里的原因。

Reference: https://groups.google.com/d/msg/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ 参考： https ： //groups.google.com/d/msg/mrjob/CfdAgcEaC-I/8XfJPXCjTvQJ

Update (shorter) 更新（更短）
For a shorter second notebook cell you can run the mrjob by invoking the shell from within the notebook 对于较短的第二个笔记本单元，您可以通过从笔记本中调用shell来运行mrjob

! python mrjob.py shakespeare.txt

Reference: http://jupyter.cs.brynmawr.edu/hub/dblank/public /Jupyter%20Magics.ipynb 参考： http ：//jupyter.cs.brynmawr.edu/hub/dblank/public/Jupyter%20Magics.ipynb

从IPython笔记本运行MRJob

问题描述

2 个解决方案

解决方案1
1 2015-07-22 14:23:51

解决方案2
1 2015-10-27 00:26:22

从IPython笔记本运行MRJob

问题描述

2 个解决方案

解决方案1 1 2015-07-22 14:23:51

解决方案2 1 2015-10-27 00:26:22

解决方案1
1 2015-07-22 14:23:51

解决方案2
1 2015-10-27 00:26:22