Hadoop MapReduce中Mapper的最大输入文件大小（不分割）是多少？

Question

I have written a MapReduce job that works on some Protobuf files as input. 我编写了一个MapReduce作业，该作业可以在某些Protobuf文件上作为输入。 Owing to the nature of the files (unsplittable), each file is processed by one mapper (implemented a custom FileInputFormat with isSplitable set to false ). 由于文件的性质（ isSplitable ），每个文件都由一个映射器处理（实现了将isSplitable设置为false的自定义FileInputFormat ）。 The application works well with input file-sizes less than ~680MB and produces the resulting files however, once the input file size crosses that limit, the application completes successfully but produces an empty file. 该应用程序可以很好地处理小于~680MB的输入文件，并生成结果文件，但是，一旦输入文件大小超过该限制，应用程序将成功完成但生成一个空文件。

I'm wondering if I'm hitting some limit of file-size for a Mapper? 我想知道我是否正在为Mapper达到文件大小的某个限制？ If it matters, the files are stored on Google Storage (GFS) and not HDFS. 如果有关系，文件将存储在Google Storage（GFS）上，而不是HDFS上。

Thanks! 谢谢！

Answer 1

Turns out I had hit a well-known Hadoop bug discussed here . 原来，我遇到了这里讨论的一个著名的Hadoop错误。 The issue here was the BytesWritable class which was used to write the Protobuf files. 这里的问题是用于写入Protobuf文件的BytesWritable类。 In the custom RecordReader I previously did 在自定义RecordReader我以前做过

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
    if(!processed){
        byte[] contents = new byte[(int) fileSplit.getLength()];
        Path file = fileSplit.getPath();
        log.debug("Path file:" + file);
        FileSystem fs = file.getFileSystem(conf);
        FSDataInputStream in = null;
        try{
            in = fs.open(file);
            IOUtils.readFully(in, contents, 0, contents.length);    
            value.set(contents, 0, contents.length);
        }catch(Exception e){
            log.error(e);
        }finally{
            IOUtils.closeQuietly(in);
        }
        processed = true;
        return true;
    }
    return false;
}

By default, the bug sets the maximum content size to INTEGER.MAX_SIZE/3 which is ~680MB. 默认情况下，该错误将最大内容大小设置为INTEGER.MAX_SIZE / 3，即〜680MB。 To get around this, I had to manually setCapacity(my_max_size) by doing 为了解决这个问题，我必须通过手动设置Capacity（my_max_size）

value.setCapacity(my_ideal_max_size)

before I did value.set() . 在我做value.set() 。

Hope this helps somebody else! 希望这对其他人有帮助！

Hadoop MapReduce中Mapper的最大输入文件大小（不分割）是多少？

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-12-15 17:37:23

Hadoop MapReduce中Mapper的最大输入文件大小（不分割）是多少？

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-12-15 17:37:23

解决方案1
0 已采纳 2016-12-15 17:37:23