Mapper file in HDFS cannot be found by streaming jar

Question

I am currently trying to get a local version of Hadoop running, but I am a bit stuck. I used the following tutorial for my setup:

http://glebche.appspot.com/static/hadoop-ecosystem/hadoop-hive-tutorial.html

Now, I would like to perform a simple Mapreduce using this tutorial:

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

However, I seem to have some issues with HDFS, because when I want to run the following command:

:libexec me$ hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -Dmapred.reduce.tasks=1 -input text/* -output text/output -mapper code/mapper.py -reducer code/reducer.py

I get the error that the mapper file cannot be found:

java.io.IOException: Cannot run program "code/mapper.py": error=2, No such file or directory

However, the file just seems to exist:

:tmp me$ hadoop dfs -ls code
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/11/20 21:28:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 lotte supergroup        536 2014-11-20 20:04 code/mapper.py
-rw-r--r--   1 lotte supergroup       1026 2014-11-20 20:04 code/reducer.py

What am I doing wrong?

Best Lotte

Answer 1

If you submitting a streaming job to hadoop cluster you have to specify the location of the mapper and reducer on the local filesystem using -file command line parameter so that hadoop will copy the files to all Mappers and Reducers so they have access to python scripts. So, try something like this:

hadoop har /path/to/hadoop-streaming.jar \
  -Dmapred.reduce.tasks=1
  -input /path/to/input
  -output /path/to/output
  -mapper /path/to/mapper.py
  -reducer /path/to/reducer.py
  -file /path/to/mapper.py
  -file /path/to/reducer.py

Make sure to replace the path's to all the parameters. Parameters -input , -output are paths in HDFS while other paths will be from local filesystem from where you are launching the job.

Answer 2

Hadoop's streaming tool supports mapping files from HDFS. Here's an example:

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
    -files hdfs://host:port/user/<hadoop_username>/code/mapper.py,hdfs://host:port/user/<hadoop_username>/code/reducer.py \
    -Dmapred.reduce.tasks=1 \
    -input text/* \
    -output text/output \
    -mapper code/mapper.py \
    -reducer code/reducer.py

Note that mapping files is still necessary. Here I use -files as -file is deprecated.

Answer 3

You are running

:libexec me$ hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -Dmapred.reduce.tasks=1 -input text/* -output text/output -mapper code/mapper.py -reducer code/reducer.py

As Ashrith is telling you, you have to use -files to specify the path to the mappers and the reducers but they don't need to be local files, it is to say, if you have a mapper called basic_mapper.py and you store it in hdfs using -put option, the you can use it from hdfs .

For example: hadoop fs -put /home/<user>/files/basic_mapper.py hadoop/mappers now your mapper is in hdfs , so you can invoke it from the new location:

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -Dmapred.reduce.tasks=1 -files hdfs://host:port/user/<hadoop_username>/hadoop/mappers/basic_mapper.py -input text/* -output text/output -mapper basic_mapper.py

Be careful because it creates a symlink called basic_mapper.py and not hadoop/mapper/basic_mapper.py

Mapper file in HDFS cannot be found by streaming jar

Question

3 answers

solution1
0 ACCPTED 2014-11-20 22:45:12

solution2
0 2015-11-13 20:55:26

solution3
0 2017-04-17 13:48:21

Mapper file in HDFS cannot be found by streaming jar

Question

3 answers

solution1 0 ACCPTED 2014-11-20 22:45:12

solution2 0 2015-11-13 20:55:26

solution3 0 2017-04-17 13:48:21

solution1
0 ACCPTED 2014-11-20 22:45:12

solution2
0 2015-11-13 20:55:26

solution3
0 2017-04-17 13:48:21