[英]Hadoop seq directory with index, data and bloom files — how to read?
New to Hadoop...I have a series of HDFS directories with the naming convention filename.seq. Hadoop的新手...我有一系列具有命名约定filename.seq的HDFS目录。 Each directory contains an index, data and bloom file.
每个目录都包含一个索引,数据和Bloom文件。 These have binary content and appear to be SequenceFiles (SEQ starts the header).
它们具有二进制内容,并且看起来像是SequenceFiles(SEQ以头开始)。 I want to know the structure/schema.
我想知道结构/架构。 Everything I read refers to reading an individual sequence file so I'm not sure how to read these or how they were produced.
我阅读的所有内容都是指读取单个序列文件,因此我不确定如何读取它们或它们是如何产生的。 Thanks.
谢谢。
Update: I've tried recommended tools for streaming & outputting text on the files, none worked: 更新:我尝试过推荐的工具,用于在文件上流式传输和输出文本,没有一个起作用:
hadoop fs -text /path/to/hdfs-filename.seq/data | head
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar \
-input /path/to/hdfs-filename.seq/data \
-output /tmp/outputfile \
-mapper "/bin/cat" \
-reducer "/bin/wc -l" \
-inputformat SequenceFileAsTextInputFormat
Error was: 错误是:
ERROR streaming.StreamJob: Job not successful. Error: NA
The SEQ header confirms that hadoop sequence file. SEQ头确认该hadoop序列文件。 (One thing that I have never seem is the bloom file that you mentioned.)
(我从未见过的一件事是您提到的Bloom文件。)
The structure / schema of a typical Sequence file is: 典型的Sequence文件的结构/架构为:
For more details: 更多细节:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.