I am currently trying to get a local version of Hadoop running, but I am a bit stuck. I used the following tutorial for my setup:
http://glebche.appspot.com/static/hadoop-ecosystem/hadoop-hive-tutorial.html
Now, I would like to perform a simple Mapreduce using this tutorial:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
However, I seem to have some issues with HDFS, because when I want to run the following command:
:libexec me$ hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -Dmapred.reduce.tasks=1 -input text/* -output text/output -mapper code/mapper.py -reducer code/reducer.py
I get the error that the mapper file cannot be found:
java.io.IOException: Cannot run program "code/mapper.py": error=2, No such file or directory
However, the file just seems to exist:
:tmp me$ hadoop dfs -ls code
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
14/11/20 21:28:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 lotte supergroup 536 2014-11-20 20:04 code/mapper.py
-rw-r--r-- 1 lotte supergroup 1026 2014-11-20 20:04 code/reducer.py
What am I doing wrong?
Best Lotte
If you submitting a streaming job to hadoop cluster you have to specify the location of the mapper and reducer on the local filesystem using -file
command line parameter so that hadoop will copy the files to all Mappers and Reducers so they have access to python scripts. So, try something like this:
hadoop har /path/to/hadoop-streaming.jar \
-Dmapred.reduce.tasks=1
-input /path/to/input
-output /path/to/output
-mapper /path/to/mapper.py
-reducer /path/to/reducer.py
-file /path/to/mapper.py
-file /path/to/reducer.py
Make sure to replace the path's to all the parameters. Parameters -input
, -output
are paths in HDFS while other paths will be from local filesystem from where you are launching the job.
Hadoop's streaming tool supports mapping files from HDFS. Here's an example:
hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
-files hdfs://host:port/user/<hadoop_username>/code/mapper.py,hdfs://host:port/user/<hadoop_username>/code/reducer.py \
-Dmapred.reduce.tasks=1 \
-input text/* \
-output text/output \
-mapper code/mapper.py \
-reducer code/reducer.py
Note that mapping files is still necessary. Here I use -files
as -file
is deprecated.
You are running
:libexec me$ hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -Dmapred.reduce.tasks=1 -input text/* -output text/output -mapper code/mapper.py -reducer code/reducer.py
As Ashrith is telling you, you have to use -files
to specify the path to the mappers and the reducers but they don't need to be local files, it is to say, if you have a mapper called basic_mapper.py
and you store it in hdfs
using -put
option, the you can use it from hdfs
.
For example: hadoop fs -put /home/<user>/files/basic_mapper.py hadoop/mappers
now your mapper is in hdfs
, so you can invoke it from the new location:
hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -Dmapred.reduce.tasks=1 -files hdfs://host:port/user/<hadoop_username>/hadoop/mappers/basic_mapper.py -input text/* -output text/output -mapper basic_mapper.py
Be careful because it creates a symlink called basic_mapper.py
and not hadoop/mapper/basic_mapper.py
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.