简体繁体 English

具有二进制输入的Hadoop流作业？

[英]Hadoop Streaming Job with binary input?

原文 2013-02-21 21:02:29 8 1 python/ hadoop/ hadoop-streaming

I wish to convert a binary file in one format to a SequenceFile. 我希望将一种格式的二进制文件转换为SequenceFile。

I have a Python script that takes that format on stdin and can output whatever I want. 我有一个在stdin上采用该格式的Python脚本，可以输出我想要的任何内容。

The input format is not line-based. 输入格式不是基于行的。 The individual records are binary themselves, hence the output format cannot be \\t delimited or broken into lines with \\n. 各个记录本身都是二进制的，因此输出格式不能用\\ t分隔或用\\ n分成几行。

Can I use the Hadoop Streaming interface to consume a binary format? 我可以使用Hadoop Streaming接口使用二进制格式吗？ How do I produce a binary output format? 如何产生二进制输出格式？

I assume the answer is "No" unless I hear otherwise. 除非另有说明，否则我认为答案是“否”。

1 个解决方案

You may consider using NullWritable as output, and generating the SequenceFile directly inside of your python script. 您可以考虑使用NullWritable作为输出，并直接在python脚本内部生成SequenceFile。 You can look up the hadoop-python project in github to see candidate code: though it is admittedly bit large-ish/heavy it does handle the sequencefile generation. 您可以在github中查找hadoop-python项目，以查看候选代码：尽管它虽然有点笨重，但确实可以处理sequencefile的生成。

Distcp与Hadoop流作业 - Distcp with Hadoop streaming job

需要 3 个输入文件的 Python MapReduce Hadoop 流作业？ - Python MapReduce Hadoop Streaming Job that requires 3 input files?

Python MapReduce Hadoop Streaming Job需要多个输入文件？ - Python MapReduce Hadoop Streaming Job that requires multiple input files?

Hadoop流作业失败“ Python” - Hadoop Streaming job failure “Python”

hadoop流作业在python中失败 - hadoop streaming job fails in python

Hadoop python 中的流式传输作业失败 - Hadoop Streaming Job failed in python

Hadoop流多行输入 - Hadoop Streaming Multiline Input

Hadoop错误：启动作业时出错，输入路径错误：文件不存在。流命令失败 - Hadoop Error: Error launching job , bad input path : File does not exist.Streaming Command Failed

Python hadoop streaming：设置一个工作名称 - Python hadoop streaming : Setting a job name

正则表达式：为hadoop流作业构造URL - Regex: construct URL for hadoop streaming job

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Distcp与Hadoop流作业 - Distcp with Hadoop streaming job 需要 3 个输入文件的 Python MapReduce Hadoop 流作业？ - Python MapReduce Hadoop Streaming Job that requires 3 input files? Python MapReduce Hadoop Streaming Job需要多个输入文件？ - Python MapReduce Hadoop Streaming Job that requires multiple input files? Hadoop流作业失败“ Python” - Hadoop Streaming job failure “Python” hadoop流作业在python中失败 - hadoop streaming job fails in python Hadoop python 中的流式传输作业失败 - Hadoop Streaming Job failed in python Hadoop流多行输入 - Hadoop Streaming Multiline Input Hadoop错误：启动作业时出错，输入路径错误：文件不存在。流命令失败 - Hadoop Error: Error launching job , bad input path : File does not exist.Streaming Command Failed Python hadoop streaming：设置一个工作名称 - Python hadoop streaming : Setting a job name 正则表达式：为hadoop流作业构造URL - Regex: construct URL for hadoop streaming job

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM