简体   繁体   English

MRJob确定是运行内联,本地,emr还是Hadoop

[英]MRJob determining if running inline, local, emr or hadoop

I am building on some old code from a few years back using the commoncrawl dataset with EMR using MRJob. 使用几年前的一些旧代码进行构建,将commoncrawl数据集与使用MRJob的EMR一起使用。 The code uses the following inside MRJob subclass mapper function to determine whether running locally or on emr: 该代码使用以下内部MRJob子类映射器函数来确定是在本地运行还是在emr上运行:

self.options.runner == 'emr'

This seems to either have never worked or no longer works, self.options.runner is not passed through to the tasks and therefore is always set to the default of 'inline' . 这似乎从未起作用或不再起作用,self.options.runner不会传递给任务,因此始终设置为默认值'inline' Question is, is there a way to determine whether the code is running locally or on emr with the current version of MRJob (v0.5.0). 问题是,是否有办法确定代码是在本地运行还是在当前版本的MRJob(v0.5.0)上的emr上运行。

I have found one solution, but I am still searching for a builtin solution if anyone knows of it. 我找到了一个解决方案,但是如果有人知道,我仍在寻找内置解决方案。 You can add a custom passthrough option that gets passed to your tasks , which would look like so: 您可以添加一个自定义传递选项,该选项将传递给您的任务 ,如下所示:

class CCJob(MRJob):

def configure_options(self):
  super(CCJob, self).configure_options()
  self.add_passthrough_option(
   '--platform', default='local', choices=['local', 'remote'],
   help="indicate running remotely")

 def mapper(self, _, line):
   if self.options.platform == 'remote':
     pass

And you must pass --platform remote when running remotely 并且--platform remote运行时必须传递--platform remote

Thank you to @pykler and @sebastian-nagel for posting about this, as trying to get the Common Crawl Python example working on Amazon EMR has been a headache. 感谢@pykler和@ sebastian-nagel发布有关此内容的信息,因为试图让Common Crawl Python示例在Amazon EMR上运行一直是一件令人头疼的事情。

In response to the solution @pykler posted, I believe there's a more idiomatic way that's shown in this PDF : 作为对@pykler发布的解决方案的回应,我相信此PDF中显示了一种更惯用的方式:

class CCJob(MRJob):
  def configure_options(self):
    super(CCJob, self).configure_options()
    self.pass_through_option('--runner')
    self.pass_through_option('-r')

and then the rest of the code, ie the if self.options.runner in ['emr', 'hadoop'] check, can be left as is and it should work on EMR by just passing the -r emr option as normal. 然后剩下的代码,即if self.options.runner in ['emr', 'hadoop']检查中的if self.options.runner in ['emr', 'hadoop'] ,可以保留原样,并且只需正常传递-r emr选项即可在EMR上运行。

Also, there seems to be an issue when running a script on EMR that imports the mrcc module. 另外,在EMR上运行导入mrcc模块的脚本时似乎存在问题。 I got an ImportError saying the module could not be found. 我收到一个ImportError说找不到模块。

To get around this, you should create a new file of the code you want to run with the from mrcc import CCJob import replaced with the actual mrcc.py code. 要解决此问题,您应该创建一个要运行的代码的新文件,并将from mrcc import CCJob import替换为实际的mrcc.py代码。 This is shown in this fork of the cc-mrjob repo. 这在cc-mrjob存储库的此fork显示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM