[英]Hadoop Streaming Job with binary input?
I wish to convert a binary file in one format to a SequenceFile. 我希望将一种格式的二进制文件转换为SequenceFile。
I have a Python script that takes that format on stdin and can output whatever I want. 我有一个在stdin上采用该格式的Python脚本,可以输出我想要的任何内容。
The input format is not line-based. 输入格式不是基于行的。 The individual records are binary themselves, hence the output format cannot be \\t delimited or broken into lines with \\n.
各个记录本身都是二进制的,因此输出格式不能用\\ t分隔或用\\ n分成几行。
Can I use the Hadoop Streaming interface to consume a binary format? 我可以使用Hadoop Streaming接口使用二进制格式吗? How do I produce a binary output format?
如何产生二进制输出格式?
I assume the answer is "No" unless I hear otherwise. 除非另有说明,否则我认为答案是“否”。
You may consider using NullWritable as output, and generating the SequenceFile directly inside of your python script. 您可以考虑使用NullWritable作为输出,并直接在python脚本内部生成SequenceFile。 You can look up the hadoop-python project in github to see candidate code: though it is admittedly bit large-ish/heavy it does handle the sequencefile generation.
您可以在github中查找hadoop-python项目,以查看候选代码:尽管它虽然有点笨重,但确实可以处理sequencefile的生成。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.