[英]What is the default size that each Hadoop mapper will read?
Is it the block size of 64 MB for HDFS? HDFS的块大小是64 MB吗? Is there any configuration parameter that I can use to change it?
是否有可用于更改它的配置参数?
For a mapper reading gzip files, is it true that the number of gzip files must be equal to the number of mappers? 对于映射器读取gzip文件,gzip文件的数量是否必须等于映射器的数量?
This is dependent on your: 这取决于您:
NLineInputFormat
, WholeFileInputFormat
) work on boundaries other than the block size. NLineInputFormat
, WholeFileInputFormat
)适用于块大小以外的边界。 In general though anything extended from FileInputFormat
will use the block boundaries as guides FileInputFormat
扩展的任何内容都将使用块边界作为指南 FileInputFormat
configuration properties mapred.min.split.size
and mapred.max.split.size
usually default to 1
and Long.MAX_VALUE
, but if this is overridden in your system configuration, or in your job, then this will change the amunt of data processed by each mapper, and the number of mapper tasks spawned. FileInputFormat
配置属性mapred.min.split.size
和mapred.max.split.size
通常默认为1
和Long.MAX_VALUE
,但如果在系统配置或作业中覆盖它,那么这将改变默认值每个映射器处理的数据,以及生成的映射器任务的数量。 CombineFileInputFormat
, CompositeInputFormat
) CombineFileInputFormat
, CompositeInputFormat
这样的东西) So if you have file with a block size of 64m, but either want to process more or less than this per map task, then you should just be able to set the following job configuration properties: 因此,如果您的文件块大小为64m,但要想处理多于或少于每个映射任务的文件,那么您应该只能设置以下作业配置属性:
mapred.min.split.size
- larger than the default, if you want to use less mappers, at the expense of (potentially) losing data locality (all data processed by a single map task may now be on 2 or more data nodes) mapred.min.split.size
- 大于默认值,如果你想使用更少的映射器,代价是(可能)丢失数据局部性(单个map任务处理的所有数据现在可能在2个或更多数据节点上) mapred.max.split.size
- smaller than default, if you want to use more mappers (say you have a CPU intensive mapper) to process each file mapred.max.split.size
- 小于默认值,如果你想使用更多的映射器(比如你有一个CPU密集型映射器)来处理每个文件 If you're using MR2 / YARN then the above properties are deprecated and replaced by: 如果您使用的是MR2 / YARN,则不推荐使用上述属性并替换为:
mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.