简体   繁体   中英

Change Mapreduce intermediate output location using MRJob

I am trying to run a python script using MRJob on a cluster in which I don't have admin permissions and I got the error pasted below. What I think is happening is that the job is trying to write the intermediate files to the default /tmp.... dir and since this is a protected directory to which I don't have permission to write, the job receives an error and exits. I would like to know how I can change this tmp output directory location to someplace in my local filesystem example: / home/myusername/some_path_in_my_local_filesystem_on_the_cluster , basically I would like to know what additional parameters I would have to pass to change the intermediate output location from /tmp/... to some place local where I have write permission.

I invoke my script as:

python myscript.py  input.txt -r hadoop > output.txt

The error:

no configs found; falling back on auto-configuration
    no configs found; falling back on auto-configuration
    creating tmp directory /tmp/13435.1.all.q/mr_word_freq_count.myusername.20131215.004905.274232
    writing wrapper script to /tmp/13435.1.all.q/mr_word_freq_count.myusername.20131215.004905.274232/setup-wrapper.sh
    STDERR: mkdir: org.apache.hadoop.security.AccessControlException: Permission denied: user=myusername, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
    Traceback (most recent call last):
      File "/home/myusername/privatemodules/python/examples/mr_word_freq_count.py", line 37, in <module>
        MRWordFreqCount.run()
      File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/job.py", line 500, in run
        mr_job.execute()
      File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/job.py", line 518, in execute
        super(MRJob, self).execute()
      File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/launch.py", line 146, in execute
        self.run_job()
      File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/launch.py", line 207, in run_job
        runner.run()
      File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
        self._run()
      File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/hadoop.py", line 236, in _run
        self._upload_local_files_to_hdfs()
      File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/hadoop.py", line 263, in _upload_local_files_to_hdfs
        self._mkdir_on_hdfs(self._upload_mgr.prefix)

Are you running mrjob as a "local" job, or trying to run it on your Hadoop cluster?

If you are actually trying to use it on Hadoop, you can control the "scratch" HDFS location (where mrjob will store intermediate files) using the --base-tmp-dir flag:

python mr.py -r hadoop -o hdfs:///user/you/output_dir --base-tmp-dir hdfs:///user/you/tmp  hdfs:///user/you/data.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM