简体繁体 English

每个Hadoop映射器将读取的默认大小是多少？

[英]What is the default size that each Hadoop mapper will read?

原文 2013-07-25 08:29:23 1 1 hadoop/ mapreduce/ hdfs

Is it the block size of 64 MB for HDFS? HDFS的块大小是64 MB吗？ Is there any configuration parameter that I can use to change it? 是否有可用于更改它的配置参数？

For a mapper reading gzip files, is it true that the number of gzip files must be equal to the number of mappers? 对于映射器读取gzip文件，gzip文件的数量是否必须等于映射器的数量？

1 个解决方案

This is dependent on your: 这取决于您：

Input format - some input formats ( NLineInputFormat , WholeFileInputFormat ) work on boundaries other than the block size. 输入格式 - 某些输入格式（ NLineInputFormat ， WholeFileInputFormat ）适用于块大小以外的边界。 In general though anything extended from FileInputFormat will use the block boundaries as guides 通常，从FileInputFormat扩展的任何内容都将使用块边界作为指南
File block size - the individual files don't need to have the same block size as the default blocks size. 文件块大小 - 单个文件不需要具有与默认块大小相同的块大小。 This is set when the file is uploaded into HDFS - if not explicitly set, then the default block size (at the time of upload) is applied. 将文件上载到HDFS时设置 - 如果未明确设置，则应用默认块大小（在上载时）。 Any changes to the default / system block size after the file is will have no effect in the already uploaded file. 文件后对默认/系统块大小的任何更改都不会对已上载的文件产生任何影响。
The two FileInputFormat configuration properties mapred.min.split.size and mapred.max.split.size usually default to 1 and Long.MAX_VALUE , but if this is overridden in your system configuration, or in your job, then this will change the amunt of data processed by each mapper, and the number of mapper tasks spawned. 两个FileInputFormat配置属性mapred.min.split.size和mapred.max.split.size通常默认为1和Long.MAX_VALUE ，但如果在系统配置或作业中覆盖它，那么这将改变默认值每个映射器处理的数据，以及生成的映射器任务的数量。
Non-splittable compression - such as gzip, cannot be processed by more than a single mapper, so you'll get 1 mapper per gzip file (unless you're using something like CombineFileInputFormat , CompositeInputFormat ) 不可拆分压缩 - 例如gzip，不能由多个映射器处理，因此每个gzip文件将获得1个映射器（除非你使用的是像CombineFileInputFormat ， CompositeInputFormat这样的东西）

So if you have file with a block size of 64m, but either want to process more or less than this per map task, then you should just be able to set the following job configuration properties: 因此，如果您的文件块大小为64m，但要想处理多于或少于每个映射任务的文件，那么您应该只能设置以下作业配置属性：

mapred.min.split.size - larger than the default, if you want to use less mappers, at the expense of (potentially) losing data locality (all data processed by a single map task may now be on 2 or more data nodes) mapred.min.split.size - 大于默认值，如果你想使用更少的映射器，代价是（可能）丢失数据局部性（单个map任务处理的所有数据现在可能在2个或更多数据节点上）
mapred.max.split.size - smaller than default, if you want to use more mappers (say you have a CPU intensive mapper) to process each file mapred.max.split.size - 小于默认值，如果你想使用更多的映射器（比如你有一个CPU密集型映射器）来处理每个文件

If you're using MR2 / YARN then the above properties are deprecated and replaced by: 如果您使用的是MR2 / YARN，则不推荐使用上述属性并替换为：