将配置单元中“ select”的输出作为Hadoop jar输入文件的输入

Question

I am experimenting with a machine learning package called vowpal wabbit. 我正在尝试一个名为vowpal wabbit的机器学习包。 To run vowpal wabbit on our hadoop cluster, it recommends to do: 要在我们的hadoop集群上运行vowpal wabbit，建议执行以下操作：

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar \
    -Dmapred.job.name="vw allreduce $in_directory" \
    -Dmapred.map.tasks.speculative.execution=true \
    -Dmapred.reduce.tasks=0 \
    -Dmapred.child.java.opts="-Xmx100m" \
    -Dmapred.task.timeout=600000000 \
    -Dmapred.job.map.memory.mb=1000 \
    -input <in_directory> \
    -output <out_directory> \
    -file /home/produser/vowpal_wabbit/vowpalwabbit/vw \
    -file /usr/lib64/libboost_program_options.so.5 \
    -file /lib64/libz.so.1 \
    -file /home/produser/vowpal_wabbit/cluster/runvw-yarn.sh \
    -mapper /home/produser/vowpal_wabbit/cluster/runvw-yarn.sh \
    -reducer NONE

where runvw-yarn.sh, as a mapper, will call vowpal wabbit's command on each machine with the piece of data that's stored on it 其中runvw-yarn.sh作为映射器，将使用存储在其上的数据在每台计算机上调用vowpal wabbit的命令

I have to reformat the data before I pass it in. I tried to use hive query to select the data from the grid, reformat it and then pass it to the "hadoop jar" command. 在传递数据之前，我必须重新格式化数据。我尝试使用配置单元查询从网格中选择数据，将其重新格式化，然后将其传递给“ hadoop jar”命令。 But I don't want to store the reformated data on our cluster to waste the space. 但是我不想将重新格式化的数据存储在集群中以浪费空间。 So I don't know what to put after the "-input" option in the "hadoop jar" command. 因此，我不知道在“ hadoop jar”命令中的“ -input”选项后面要放什么。

So my question is, is there a way to put something like "stdin" after the "-input" command? 所以我的问题是，有没有办法在“ -input”命令后放置“ stdin”之类的东西？ And also where should I put that "hadoop jar" command in my hive query after I select the data? 选择数据后，在配置单元查询中应将“ hadoop jar”命令放在哪里？

PS I found "hive --service jar" and it looks similar like hadoop jar, is this helpful here? PS我发现“配置单元--service jar”，它看起来像hadoop jar，这对您有帮助吗？

Thank you! 谢谢！ I just started to learn hadoop and hive a couple of weeks ago, so if you have a better design or solution, feel free to let me know. 我几周前才开始学习hadoop和蜂巢，因此，如果您有更好的设计或解决方案，请随时告诉我。 I can rewrite every thing. 我可以重写所有东西。

Answer 1

It seems that you are gonna run two rounds of Mapreduce: 1st is a Hive query and 2nd is MapReduce stream. 看来您要运行两轮Mapreduce：第一个是Hive查询，第二个是MapReduce流。 As far as I am concerned, to use multi-round Mapreduce jobs, we always need to write/read to/from hdfs between the rounds. 就我而言，要使用多回合Mapreduce作业，我们总是需要在回合之间向hdfs写入/读取。 That's why MapReduce is always called batch operation. 这就是为什么MapReduce始终称为批处理操作的原因。

So, the answer to your question is NO. 因此，您的问题的答案为否。

将配置单元中“ select”的输出作为Hadoop jar输入文件的输入

问题描述

1 个解决方案

解决方案1
0 2013-08-06 13:30:17

将配置单元中“ select”的输出作为Hadoop jar输入文件的输入

问题描述

1 个解决方案

解决方案1 0 2013-08-06 13:30:17

解决方案1
0 2013-08-06 13:30:17