简体   繁体   English

Java-在Hadoop中下载序列文件

[英]Java - Download sequence file in Hadoop

I have problem to copy the binary files (which is store as sequence files in Hadoop) to my local machine. 我在将二进制文件(作为序列文件存储在Hadoop中)复制到本地计算机时遇到问题。 The problem is that the binary file I downloaded from hdfs was not the original binary file I generated when I'm running map-reduce tasks. 问题是我从hdfs下载的二进制文件不是我在运行map-reduce任务时生成的原始二进制文件。 I Googled similar problems and I guess the issue is that when I copy the sequence files to my local machine, I got the header of the sequence file plus the original file. 我用谷歌搜索了类似的问题,我想问题是当我将序列文件复制到本地计算机时,我得到了序列文件的标题和原始文件。

My question is: Is there any way to avoid download the header but still preserve my original binary file? 我的问题是:有什么办法可以避免下载标头,但仍保留我的原始二进制文件?

There are two ways I can think about: 我可以考虑两种方式:

  1. I can transform the binary file into some other format like Text so that I can avoid using SequenceFile. 我可以将二进制文件转换为其他格式,例如“文本”,以便避免使用SequenceFile。 After I do copyToLocal, I transform it back to binary file. 在执行copyToLocal之后,将其转换回二进制文件。

  2. I still use the sequence file. 我仍然使用序列文件。 But when I generate the binary file, I also generate some meta information about the corresponding sequence file (eg the length of the header and the original length of the file). 但是,当我生成二进制文件时,我还会生成一些有关相应序列文件的元信息(例如标头的长度和文件的原始长度)。 And after I do copyToLocal, I use the downloaded binary file (which contains header, etc.) along with the meta information to recover my original binary file. 在执行copyToLocal之后,我将使用下载的二进制文件(包含标头等)以及元信息来恢复我的原始二进制文件。

I don't know which one is feasible. 我不知道哪一个可行。 Could anyone give me a solution? 谁能给我解决方案? Could you also show me some sample code for the solution you give? 您还可以向我展示您提供的解决方案的一些示例代码吗?

I highly appreciate your help. 非常感谢您的帮助。

I found a workaround for this question. 我找到了解决此问题的方法。 Since downloading sequence file will give you header and other magic word in the binary file, the way I avoid this problem is to transform my original binary file into Base64 String and store it as Text in HDFS and when downloading the encoded binary files, I decode it back to my original binary file. 由于下载序列文件将为您提供二进制文件中的标头和其他魔术字,因此避免此问题的方法是将原始二进制文件转换为Base64 String并将其存储为HDFS中的Text,并在下载编码的二进制文件时解码它回到我的原始二进制文件。

I know this will take extra time but currently I don't find any other solution to this problem. 我知道这将花费额外的时间,但是目前我没有找到其他解决此问题的方法。 The hard part to directly remove headers and other magic words in the sequence file is that Hadoop may insert some word "Sync" in between my binary file. 直接删除序列文件中的标头和其他不可思议的单词的难点是,Hadoop可能会在我的二进制文件之间插入一些单词“ Sync”。

If anyone have a better solution to this problem, I'd be very happy to hear about that. 如果有人对这个问题有更好的解决方案,我很高兴听到这一消息。 :) :)

Use a MapReduce Code to read the SequenceFile and use the SequenceFileInputFormat as InputFileFormat to read the Sequence File in HDFS. 使用MapReduce代码读取SequenceFile,并使用SequenceFileInputFormat作为InputFileFormat读取HDFS中的序列文件。 This would split the file as Key Value pairs and the value would have only the binary file contents which you can use to create your binary file. 这会将文件拆分为“键值”对,并且该值将仅具有可用于创建二进制文件的二进制文件内容。

Here is a code snippet to split a sequence file that is made of multiple images and split that into individual binary files and write it into local file system. 这是一个代码片段,用于拆分由多个图像组成的序列文件,并将其拆分为单个二进制文件,然后将其写入本地文件系统。

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class CreateOrgFilesFromSeqFile {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        if (args.length !=2){
            System.out.println("Incorrect No of args (" + args.length + "). Expected 2 args: <seqFileInputPath> <outputPath>");
            System.exit(-1);
        }

        Path seqFileInputPath = new Path(args[0]);
        Path outputPath = new Path(args[1]);

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "CreateSequenceFile");

        job.setJarByClass(M4A6C_CreateOrgFilesFromSeqFile.class);
        job.setMapperClass(CreateOrgFileFromSeqFileMapper.class);

        job.setInputFormatClass(SequenceFileInputFormat.class);

        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, seqFileInputPath);
        FileOutputFormat.setOutputPath(job, outputPath);

        //Delete the existing output File
        outputPath.getFileSystem(conf).delete(outputPath, true);

        System.exit(job.waitForCompletion(true)? 0 : -1);

    }

}

class CreateOrgFileFromSeqFileMapper extends Mapper<Text, BytesWritable, NullWritable, Text>{

    @Override
    public void map(Text key, BytesWritable value, Context context) throws IOException, InterruptedException{


        Path outputPath = FileOutputFormat.getOutputPath(context);
        FileSystem fs = outputPath.getFileSystem(context.getConfiguration());

        String[] filePathWords = key.toString().split("/");
        String fileName = filePathWords[filePathWords.length-1];

        System.out.println("outputPath.toString()+ key: " + outputPath.toString() + "/" + fileName + "value length : " + value.getLength());

        try(FSDataOutputStream fdos = fs.create(new Path(outputPath.toString() + "/" + fileName)); ){

            fdos.write(value.getBytes(),0,value.getLength());
            fdos.flush();
        }

            //System.out.println("value: " + value + ";\t baos.toByteArray().length: " + baos.toByteArray().length);
            context.write(NullWritable.get(), new Text(outputPath.toString() + "/" + fileName));            
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM