简体   繁体   中英

MRJob determining if running inline, local, emr or hadoop

I am building on some old code from a few years back using the commoncrawl dataset with EMR using MRJob. The code uses the following inside MRJob subclass mapper function to determine whether running locally or on emr:

self.options.runner == 'emr'

This seems to either have never worked or no longer works, self.options.runner is not passed through to the tasks and therefore is always set to the default of 'inline' . Question is, is there a way to determine whether the code is running locally or on emr with the current version of MRJob (v0.5.0).

I have found one solution, but I am still searching for a builtin solution if anyone knows of it. You can add a custom passthrough option that gets passed to your tasks , which would look like so:

class CCJob(MRJob):

def configure_options(self):
  super(CCJob, self).configure_options()
  self.add_passthrough_option(
   '--platform', default='local', choices=['local', 'remote'],
   help="indicate running remotely")

 def mapper(self, _, line):
   if self.options.platform == 'remote':
     pass

And you must pass --platform remote when running remotely

Thank you to @pykler and @sebastian-nagel for posting about this, as trying to get the Common Crawl Python example working on Amazon EMR has been a headache.

In response to the solution @pykler posted, I believe there's a more idiomatic way that's shown in this PDF :

class CCJob(MRJob):
  def configure_options(self):
    super(CCJob, self).configure_options()
    self.pass_through_option('--runner')
    self.pass_through_option('-r')

and then the rest of the code, ie the if self.options.runner in ['emr', 'hadoop'] check, can be left as is and it should work on EMR by just passing the -r emr option as normal.

Also, there seems to be an issue when running a script on EMR that imports the mrcc module. I got an ImportError saying the module could not be found.

To get around this, you should create a new file of the code you want to run with the from mrcc import CCJob import replaced with the actual mrcc.py code. This is shown in this fork of the cc-mrjob repo.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM