简体   繁体   English

具有二进制输入的Hadoop流作业?

[英]Hadoop Streaming Job with binary input?

I wish to convert a binary file in one format to a SequenceFile. 我希望将一种格式的二进制文件转换为SequenceFile。

I have a Python script that takes that format on stdin and can output whatever I want. 我有一个在stdin上采用该格式的Python脚本,可以输出我想要的任何内容。

The input format is not line-based. 输入格式不是基于行的。 The individual records are binary themselves, hence the output format cannot be \\t delimited or broken into lines with \\n. 各个记录本身都是二进制的,因此输出格式不能用\\ t分隔或用\\ n分成几行。

Can I use the Hadoop Streaming interface to consume a binary format? 我可以使用Hadoop Streaming接口使用二进制格式吗? How do I produce a binary output format? 如何产生二进制输出格式?

I assume the answer is "No" unless I hear otherwise. 除非另有说明,否则我认为答案是“否”。

You may consider using NullWritable as output, and generating the SequenceFile directly inside of your python script. 您可以考虑使用NullWritable作为输出,并直接在python脚本内部生成SequenceFile。 You can look up the hadoop-python project in github to see candidate code: though it is admittedly bit large-ish/heavy it does handle the sequencefile generation. 您可以在github中查找hadoop-python项目,以查看候选代码:尽管它虽然有点笨重,但确实可以处理sequencefile的生成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM