简体   繁体   中英

Mapper file in HDFS cannot be found by streaming jar

I am currently trying to get a local version of Hadoop running, but I am a bit stuck. I used the following tutorial for my setup:

http://glebche.appspot.com/static/hadoop-ecosystem/hadoop-hive-tutorial.html

Now, I would like to perform a simple Mapreduce using this tutorial:

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

However, I seem to have some issues with HDFS, because when I want to run the following command:

:libexec me$ hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -Dmapred.reduce.tasks=1 -input text/* -output text/output -mapper code/mapper.py -reducer code/reducer.py

I get the error that the mapper file cannot be found:

java.io.IOException: Cannot run program "code/mapper.py": error=2, No such file or directory

However, the file just seems to exist:

:tmp me$ hadoop dfs -ls code
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/11/20 21:28:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 lotte supergroup        536 2014-11-20 20:04 code/mapper.py
-rw-r--r--   1 lotte supergroup       1026 2014-11-20 20:04 code/reducer.py

What am I doing wrong?

Best Lotte

If you submitting a streaming job to hadoop cluster you have to specify the location of the mapper and reducer on the local filesystem using -file command line parameter so that hadoop will copy the files to all Mappers and Reducers so they have access to python scripts. So, try something like this:

hadoop har /path/to/hadoop-streaming.jar \
  -Dmapred.reduce.tasks=1
  -input /path/to/input
  -output /path/to/output
  -mapper /path/to/mapper.py
  -reducer /path/to/reducer.py
  -file /path/to/mapper.py
  -file /path/to/reducer.py

Make sure to replace the path's to all the parameters. Parameters -input , -output are paths in HDFS while other paths will be from local filesystem from where you are launching the job.

Hadoop's streaming tool supports mapping files from HDFS. Here's an example:

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
    -files hdfs://host:port/user/<hadoop_username>/code/mapper.py,hdfs://host:port/user/<hadoop_username>/code/reducer.py \
    -Dmapred.reduce.tasks=1 \
    -input text/* \
    -output text/output \
    -mapper code/mapper.py \
    -reducer code/reducer.py

Note that mapping files is still necessary. Here I use -files as -file is deprecated.

You are running

:libexec me$ hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -Dmapred.reduce.tasks=1 -input text/* -output text/output -mapper code/mapper.py -reducer code/reducer.py

As Ashrith is telling you, you have to use -files to specify the path to the mappers and the reducers but they don't need to be local files, it is to say, if you have a mapper called basic_mapper.py and you store it in hdfs using -put option, the you can use it from hdfs .

For example: hadoop fs -put /home/<user>/files/basic_mapper.py hadoop/mappers now your mapper is in hdfs , so you can invoke it from the new location:

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -Dmapred.reduce.tasks=1 -files hdfs://host:port/user/<hadoop_username>/hadoop/mappers/basic_mapper.py -input text/* -output text/output -mapper basic_mapper.py

Be careful because it creates a symlink called basic_mapper.py and not hadoop/mapper/basic_mapper.py

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM