简体   繁体   English

将配置单元中“ select”的输出作为Hadoop jar输入文件的输入

[英]Take the output of “select” in hive as the input of Hadoop jar input file

I am experimenting with a machine learning package called vowpal wabbit. 我正在尝试一个名为vowpal wabbit的机器学习包。 To run vowpal wabbit on our hadoop cluster, it recommends to do: 要在我们的hadoop集群上运行vowpal wabbit,建议执行以下操作:

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar \
    -Dmapred.job.name="vw allreduce $in_directory" \
    -Dmapred.map.tasks.speculative.execution=true \
    -Dmapred.reduce.tasks=0 \
    -Dmapred.child.java.opts="-Xmx100m" \
    -Dmapred.task.timeout=600000000 \
    -Dmapred.job.map.memory.mb=1000 \
    -input <in_directory> \
    -output <out_directory> \
    -file /home/produser/vowpal_wabbit/vowpalwabbit/vw \
    -file /usr/lib64/libboost_program_options.so.5 \
    -file /lib64/libz.so.1 \
    -file /home/produser/vowpal_wabbit/cluster/runvw-yarn.sh \
    -mapper /home/produser/vowpal_wabbit/cluster/runvw-yarn.sh \
    -reducer NONE

where runvw-yarn.sh, as a mapper, will call vowpal wabbit's command on each machine with the piece of data that's stored on it 其中runvw-yarn.sh作为映射器,将使用存储在其上的数据在每台计算机上调用vowpal wabbit的命令

I have to reformat the data before I pass it in. I tried to use hive query to select the data from the grid, reformat it and then pass it to the "hadoop jar" command. 在传递数据之前,我必须重新格式化数据。我尝试使用配置单元查询从网格中选择数据,将其重新格式化,然后将其传递给“ hadoop jar”命令。 But I don't want to store the reformated data on our cluster to waste the space. 但是我不想将重新格式化的数据存储在集群中以浪费空间。 So I don't know what to put after the "-input" option in the "hadoop jar" command. 因此,我不知道在“ hadoop jar”命令中的“ -input”选项后面要放什么。

So my question is, is there a way to put something like "stdin" after the "-input" command? 所以我的问题是,有没有办法在“ -input”命令后放置“ stdin”之类的东西? And also where should I put that "hadoop jar" command in my hive query after I select the data? 选择数据后,在配置单元查询中应将“ hadoop jar”命令放在哪里?

PS I found "hive --service jar" and it looks similar like hadoop jar, is this helpful here? PS我发现“配置单元--service jar”,它看起来像hadoop jar,这对您有帮助吗?

Thank you! 谢谢! I just started to learn hadoop and hive a couple of weeks ago, so if you have a better design or solution, feel free to let me know. 我几周前才开始学习hadoop和蜂巢,因此,如果您有更好的设计或解决方案,请随时告诉我。 I can rewrite every thing. 我可以重写所有东西。

It seems that you are gonna run two rounds of Mapreduce: 1st is a Hive query and 2nd is MapReduce stream. 看来您要运行两轮Mapreduce:第一个是Hive查询,第二个是MapReduce流。 As far as I am concerned, to use multi-round Mapreduce jobs, we always need to write/read to/from hdfs between the rounds. 就我而言,要使用多回合Mapreduce作业,我们总是需要在回合之间向hdfs写入/读取。 That's why MapReduce is always called batch operation. 这就是为什么MapReduce始终称为批处理操作的原因。

So, the answer to your question is NO. 因此,您的问题的答案为否。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM