Take the output of “select” in hive as the input of Hadoop jar input file

Question

I am experimenting with a machine learning package called vowpal wabbit. To run vowpal wabbit on our hadoop cluster, it recommends to do:

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar \
    -Dmapred.job.name="vw allreduce $in_directory" \
    -Dmapred.map.tasks.speculative.execution=true \
    -Dmapred.reduce.tasks=0 \
    -Dmapred.child.java.opts="-Xmx100m" \
    -Dmapred.task.timeout=600000000 \
    -Dmapred.job.map.memory.mb=1000 \
    -input <in_directory> \
    -output <out_directory> \
    -file /home/produser/vowpal_wabbit/vowpalwabbit/vw \
    -file /usr/lib64/libboost_program_options.so.5 \
    -file /lib64/libz.so.1 \
    -file /home/produser/vowpal_wabbit/cluster/runvw-yarn.sh \
    -mapper /home/produser/vowpal_wabbit/cluster/runvw-yarn.sh \
    -reducer NONE

where runvw-yarn.sh, as a mapper, will call vowpal wabbit's command on each machine with the piece of data that's stored on it

I have to reformat the data before I pass it in. I tried to use hive query to select the data from the grid, reformat it and then pass it to the "hadoop jar" command. But I don't want to store the reformated data on our cluster to waste the space. So I don't know what to put after the "-input" option in the "hadoop jar" command.

So my question is, is there a way to put something like "stdin" after the "-input" command? And also where should I put that "hadoop jar" command in my hive query after I select the data?

PS I found "hive --service jar" and it looks similar like hadoop jar, is this helpful here?

Thank you! I just started to learn hadoop and hive a couple of weeks ago, so if you have a better design or solution, feel free to let me know. I can rewrite every thing.

Answer 1

It seems that you are gonna run two rounds of Mapreduce: 1st is a Hive query and 2nd is MapReduce stream. As far as I am concerned, to use multi-round Mapreduce jobs, we always need to write/read to/from hdfs between the rounds. That's why MapReduce is always called batch operation.

So, the answer to your question is NO.

Take the output of “select” in hive as the input of Hadoop jar input file

Question

1 answers

solution1
0 2013-08-06 13:30:17

Take the output of “select” in hive as the input of Hadoop jar input file

Question

1 answers

solution1 0 2013-08-06 13:30:17

solution1
0 2013-08-06 13:30:17