简体   繁体   English

Hadoop MapReduce中Mapper的最大输入文件大小(不分割)是多少?

[英]What is the maximum input file size (without split ) for a Mapper in Hadoop MapReduce?

I have written a MapReduce job that works on some Protobuf files as input. 我编写了一个MapReduce作业,该作业可以在某些Protobuf文件上作为输入。 Owing to the nature of the files (unsplittable), each file is processed by one mapper (implemented a custom FileInputFormat with isSplitable set to false ). 由于文件的性质( isSplitable ),每个文件都由一个映射器处理(实现了将isSplitable设置为false的自定义FileInputFormat )。 The application works well with input file-sizes less than ~680MB and produces the resulting files however, once the input file size crosses that limit, the application completes successfully but produces an empty file. 该应用程序可以很好地处理小于~680MB的输入文件,并生成结果文件,但是,一旦输入文件大小超过该限制,应用程序将成功完成但生成一个空文件。

I'm wondering if I'm hitting some limit of file-size for a Mapper? 我想知道我是否正在为Mapper达到文件大小的某个限制? If it matters, the files are stored on Google Storage (GFS) and not HDFS. 如果有关系,文件将存储在Google Storage(GFS)上,而不是HDFS上。

Thanks! 谢谢!

Turns out I had hit a well-known Hadoop bug discussed here . 原来,我遇到了这里讨论的一个著名的Hadoop错误。 The issue here was the BytesWritable class which was used to write the Protobuf files. 这里的问题是用于写入Protobuf文件的BytesWritable类。 In the custom RecordReader I previously did 在自定义RecordReader我以前做过

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
    if(!processed){
        byte[] contents = new byte[(int) fileSplit.getLength()];
        Path file = fileSplit.getPath();
        log.debug("Path file:" + file);
        FileSystem fs = file.getFileSystem(conf);
        FSDataInputStream in = null;
        try{
            in = fs.open(file);
            IOUtils.readFully(in, contents, 0, contents.length);    
            value.set(contents, 0, contents.length);
        }catch(Exception e){
            log.error(e);
        }finally{
            IOUtils.closeQuietly(in);
        }
        processed = true;
        return true;
    }
    return false;
}

By default, the bug sets the maximum content size to INTEGER.MAX_SIZE/3 which is ~680MB. 默认情况下,该错误将最大内容大小设置为INTEGER.MAX_SIZE / 3,即〜680MB。 To get around this, I had to manually setCapacity(my_max_size) by doing 为了解决这个问题,我必须通过手动设置Capacity(my_max_size)

value.setCapacity(my_ideal_max_size)

before I did value.set() . 在我做value.set()

Hope this helps somebody else! 希望这对其他人有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM