Hadoop seq directory with index, data and bloom files — how to read?

Question

New to Hadoop...I have a series of HDFS directories with the naming convention filename.seq. Each directory contains an index, data and bloom file. These have binary content and appear to be SequenceFiles (SEQ starts the header). I want to know the structure/schema. Everything I read refers to reading an individual sequence file so I'm not sure how to read these or how they were produced. Thanks.

Update: I've tried recommended tools for streaming & outputting text on the files, none worked:

hadoop fs -text /path/to/hdfs-filename.seq/data | head

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar \
-input /path/to/hdfs-filename.seq/data \
-output /tmp/outputfile \
-mapper "/bin/cat" \
-reducer "/bin/wc -l" \
-inputformat SequenceFileAsTextInputFormat

Error was:

ERROR streaming.StreamJob: Job not successful. Error: NA

Answer 1

The SEQ header confirms that hadoop sequence file. (One thing that I have never seem is the bloom file that you mentioned.)

The structure / schema of a typical Sequence file is:

Header (version, key class, value class, compression, compression code, metadata)
Record
Record length
Key length
Key Value
A sync-marker every few 100 bytes or so.

For more details:

see the description here .
Sequence file reader and How to read hadoop sequential file?

Hadoop seq directory with index, data and bloom files — how to read?

Question

1 answers

solution1
1 ACCPTED 2013-05-27 21:49:17

Hadoop seq directory with index, data and bloom files — how to read?

Question

1 answers

solution1 1 ACCPTED 2013-05-27 21:49:17

solution1
1 ACCPTED 2013-05-27 21:49:17