简体   繁体   English

带有索引,数据和Bloom文件的Hadoop seq目录-如何读取?

[英]Hadoop seq directory with index, data and bloom files — how to read?

New to Hadoop...I have a series of HDFS directories with the naming convention filename.seq. Hadoop的新手...我有一系列具有命名约定filename.seq的HDFS目录。 Each directory contains an index, data and bloom file. 每个目录都包含一个索引,数据和Bloom文件。 These have binary content and appear to be SequenceFiles (SEQ starts the header). 它们具有二进制内容,并且看起来像是SequenceFiles(SEQ以头开始)。 I want to know the structure/schema. 我想知道结构/架构。 Everything I read refers to reading an individual sequence file so I'm not sure how to read these or how they were produced. 我阅读的所有内容都是指读取单个序列文件,因此我不确定如何读取它们或它们是如何产生的。 Thanks. 谢谢。

Update: I've tried recommended tools for streaming & outputting text on the files, none worked: 更新:我尝试过推荐的工具,用于在文件上流式传输和输出文本,没有一个起作用:

hadoop fs -text /path/to/hdfs-filename.seq/data | head

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar \
-input /path/to/hdfs-filename.seq/data \
-output /tmp/outputfile \
-mapper "/bin/cat" \
-reducer "/bin/wc -l" \
-inputformat SequenceFileAsTextInputFormat

Error was: 错误是:

ERROR streaming.StreamJob: Job not successful. Error: NA

The SEQ header confirms that hadoop sequence file. SEQ头确认该hadoop序列文件。 (One thing that I have never seem is the bloom file that you mentioned.) (我从未见过的一件事是您提到的Bloom文件。)

The structure / schema of a typical Sequence file is: 典型的Sequence文件的结构/架构为:

  • Header (version, key class, value class, compression, compression code, metadata) 标头(版本,键类,值类,压缩,压缩代码,元数据)
  • Record 记录
  • Record length 记录长度
  • Key length 键长
  • Key Value 核心价值
  • A sync-marker every few 100 bytes or so. 每隔100个字节左右就有一个同步标记。

For more details: 更多细节:

  1. see the description here . 请参阅此处的说明。
  2. Sequence file reader and How to read hadoop sequential file? 序列文件阅读器以及如何读取hadoop序列文件?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM