[英]What is the maximum input file size (without split ) for a Mapper in Hadoop MapReduce?
I have written a MapReduce job that works on some Protobuf files as input. 我编写了一个MapReduce作业,该作业可以在某些Protobuf文件上作为输入。 Owing to the nature of the files (unsplittable), each file is processed by one mapper (implemented a custom
FileInputFormat
with isSplitable
set to false
). 由于文件的性质(
isSplitable
),每个文件都由一个映射器处理(实现了将isSplitable
设置为false
的自定义FileInputFormat
)。 The application works well with input file-sizes less than ~680MB
and produces the resulting files however, once the input file size crosses that limit, the application completes successfully but produces an empty file. 该应用程序可以很好地处理小于
~680MB
的输入文件,并生成结果文件,但是,一旦输入文件大小超过该限制,应用程序将成功完成但生成一个空文件。
I'm wondering if I'm hitting some limit of file-size for a Mapper? 我想知道我是否正在为Mapper达到文件大小的某个限制? If it matters, the files are stored on Google Storage (GFS) and not HDFS.
如果有关系,文件将存储在Google Storage(GFS)上,而不是HDFS上。
Thanks! 谢谢!
Turns out I had hit a well-known Hadoop bug discussed here . 原来,我遇到了这里讨论的一个著名的Hadoop错误。 The issue here was the
BytesWritable
class which was used to write the Protobuf files. 这里的问题是用于写入Protobuf文件的
BytesWritable
类。 In the custom RecordReader
I previously did 在自定义
RecordReader
我以前做过
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if(!processed){
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
log.debug("Path file:" + file);
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try{
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
}catch(Exception e){
log.error(e);
}finally{
IOUtils.closeQuietly(in);
}
processed = true;
return true;
}
return false;
}
By default, the bug sets the maximum content size to INTEGER.MAX_SIZE/3 which is ~680MB. 默认情况下,该错误将最大内容大小设置为INTEGER.MAX_SIZE / 3,即〜680MB。 To get around this, I had to manually setCapacity(my_max_size) by doing
为了解决这个问题,我必须通过手动设置Capacity(my_max_size)
value.setCapacity(my_ideal_max_size)
before I did value.set()
. 在我做
value.set()
。
Hope this helps somebody else! 希望这对其他人有帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.