Java - Download sequence file in Hadoop

Question

I have problem to copy the binary files (which is store as sequence files in Hadoop) to my local machine. The problem is that the binary file I downloaded from hdfs was not the original binary file I generated when I'm running map-reduce tasks. I Googled similar problems and I guess the issue is that when I copy the sequence files to my local machine, I got the header of the sequence file plus the original file.

My question is: Is there any way to avoid download the header but still preserve my original binary file?

There are two ways I can think about:

I can transform the binary file into some other format like Text so that I can avoid using SequenceFile. After I do copyToLocal, I transform it back to binary file.
I still use the sequence file. But when I generate the binary file, I also generate some meta information about the corresponding sequence file (eg the length of the header and the original length of the file). And after I do copyToLocal, I use the downloaded binary file (which contains header, etc.) along with the meta information to recover my original binary file.

I don't know which one is feasible. Could anyone give me a solution? Could you also show me some sample code for the solution you give?

I highly appreciate your help.

Answer 1

I found a workaround for this question. Since downloading sequence file will give you header and other magic word in the binary file, the way I avoid this problem is to transform my original binary file into Base64 String and store it as Text in HDFS and when downloading the encoded binary files, I decode it back to my original binary file.

I know this will take extra time but currently I don't find any other solution to this problem. The hard part to directly remove headers and other magic words in the sequence file is that Hadoop may insert some word "Sync" in between my binary file.

If anyone have a better solution to this problem, I'd be very happy to hear about that. :)

Answer 2

Use a MapReduce Code to read the SequenceFile and use the SequenceFileInputFormat as InputFileFormat to read the Sequence File in HDFS. This would split the file as Key Value pairs and the value would have only the binary file contents which you can use to create your binary file.

Here is a code snippet to split a sequence file that is made of multiple images and split that into individual binary files and write it into local file system.

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class CreateOrgFilesFromSeqFile {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        if (args.length !=2){
            System.out.println("Incorrect No of args (" + args.length + "). Expected 2 args: <seqFileInputPath> <outputPath>");
            System.exit(-1);
        }

        Path seqFileInputPath = new Path(args[0]);
        Path outputPath = new Path(args[1]);

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "CreateSequenceFile");

        job.setJarByClass(M4A6C_CreateOrgFilesFromSeqFile.class);
        job.setMapperClass(CreateOrgFileFromSeqFileMapper.class);

        job.setInputFormatClass(SequenceFileInputFormat.class);

        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, seqFileInputPath);
        FileOutputFormat.setOutputPath(job, outputPath);

        //Delete the existing output File
        outputPath.getFileSystem(conf).delete(outputPath, true);

        System.exit(job.waitForCompletion(true)? 0 : -1);

    }

}

class CreateOrgFileFromSeqFileMapper extends Mapper<Text, BytesWritable, NullWritable, Text>{

    @Override
    public void map(Text key, BytesWritable value, Context context) throws IOException, InterruptedException{


        Path outputPath = FileOutputFormat.getOutputPath(context);
        FileSystem fs = outputPath.getFileSystem(context.getConfiguration());

        String[] filePathWords = key.toString().split("/");
        String fileName = filePathWords[filePathWords.length-1];

        System.out.println("outputPath.toString()+ key: " + outputPath.toString() + "/" + fileName + "value length : " + value.getLength());

        try(FSDataOutputStream fdos = fs.create(new Path(outputPath.toString() + "/" + fileName)); ){

            fdos.write(value.getBytes(),0,value.getLength());
            fdos.flush();
        }

            //System.out.println("value: " + value + ";\t baos.toByteArray().length: " + baos.toByteArray().length);
            context.write(NullWritable.get(), new Text(outputPath.toString() + "/" + fileName));            
    }
}

Java - Download sequence file in Hadoop

Question

2 answers

solution1
1 ACCPTED 2015-11-07 16:16:57

solution2
0 2017-09-20 18:48:48

Java - Download sequence file in Hadoop

Question

2 answers

solution1 1 ACCPTED 2015-11-07 16:16:57

solution2 0 2017-09-20 18:48:48

solution1
1 ACCPTED 2015-11-07 16:16:57

solution2
0 2017-09-20 18:48:48