简体   繁体   中英

Hadoop seq directory with index, data and bloom files — how to read?

New to Hadoop...I have a series of HDFS directories with the naming convention filename.seq. Each directory contains an index, data and bloom file. These have binary content and appear to be SequenceFiles (SEQ starts the header). I want to know the structure/schema. Everything I read refers to reading an individual sequence file so I'm not sure how to read these or how they were produced. Thanks.

Update: I've tried recommended tools for streaming & outputting text on the files, none worked:

hadoop fs -text /path/to/hdfs-filename.seq/data | head

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar \
-input /path/to/hdfs-filename.seq/data \
-output /tmp/outputfile \
-mapper "/bin/cat" \
-reducer "/bin/wc -l" \
-inputformat SequenceFileAsTextInputFormat

Error was:

ERROR streaming.StreamJob: Job not successful. Error: NA

The SEQ header confirms that hadoop sequence file. (One thing that I have never seem is the bloom file that you mentioned.)

The structure / schema of a typical Sequence file is:

  • Header (version, key class, value class, compression, compression code, metadata)
  • Record
  • Record length
  • Key length
  • Key Value
  • A sync-marker every few 100 bytes or so.

For more details:

  1. see the description here .
  2. Sequence file reader and How to read hadoop sequential file?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM