简体繁体 English

Hadoop Mapper：适当的输入文件大小？

[英]Hadoop Mapper: Appropriate input files size?

原文 2012-10-12 09:40:05 7 3 hadoop/ mapreduce

I have clusters HDFS block size is 64 MB. 我有集群HDFS块大小是64 MB。 I have directory containing 100 plain text files, each of which is is 100 MB in size. 我有包含100个纯文本文件的目录，每个文件的大小都是100 MB。 The InputFormat for the job is TextInputFormat . 作业的InputFormat是TextInputFormat 。 How many Mappers will run? 将运行多少Mappers？

I saw this question in Hadoop Developer exam. 我在Hadoop Developer考试中看到了这个问题。 Answer is 100. Other three answer options were 64, 640, 200. But I am not sure how 100 comes or answer is wrong. 答案是100.其他三个答案选项分别为64,640,200。但我不确定100的答案或答案是错误的。

Please guide. 请指导。 Thanks in advance. 提前致谢。

3 个解决方案

I would agree with your assessment that this appears wrong 我同意你的评估，这似乎是错误的

Unless of course there is more to the exam question not posted: 当然除非未发布的考试题目有更多内容：

Are these 'plain' text files gzip compressed - in which case they are not splittable?) 这些'普通'文本文件是否压缩了gzip - 在这种情况下它们是不可拆分的？）
The cluster split size may be 64MB, but what's the assigned split size of the input files - 128MB? 群集分割大小可能是64MB，但是分配的分割大小是什么 - 128MB？

To be fair to the exam question and 'correct' answer we need the exam question in full entirety. 为了公平对待考试问题和“正确”答案，我们需要完整的考试题目。

The correct answer should be 200 (if the file block sizes are all the default 64MB, and the files are either not compressed, or compressed with a splittable codec such as snappy) 正确的答案应该是200（如果文件块大小都是默认的64MB，文件未压缩，或使用可分割的编解码器压缩，如snappy）

Looks like answer was wrong to me. 看起来答案对我来说是错误的。

But it may be correct in below scenarios: 但在以下情况下可能是正确的：

1) If we override isSplitable method and if we return false, then the number of map tasks will be same as number of input files. 1）如果我们覆盖isSplitable方法并且如果我们返回false，那么map任务的数量将与输入文件的数量相同。 In this case it will be 100. 在这种情况下，它将是100。

2) If we configure mapred.min.split.size, mapred.max.split.size variables.By default, min split size is 0 and max split size is Long.MAX. 2）如果我们配置mapred.min.split.size，mapred.max.split.size变量。默认情况下，min split size为0，max split size为Long.MAX。

Below is the function it uses to identify the number of mappers. 下面是它用于识别映射器数量的函数。

max(mapred.min.split.size, min(mapred.max.split.size, blocksize)) max（mapred.min.split.size，min（mapred.max.split.size，blocksize））

In this scenario, if we configure mapred.min.split.size as 100, Then we will have 100 mappers. 在这种情况下，如果我们将mapred.min.split.size配置为100，那么我们将有100个映射器。

But according to given information, i think 100 is not right answer. 但根据给定的信息，我认为100不是正确的答案。

当块大小（64 MB）小于文件大小（100 MB）时，每个文件将被拆分为两个，因此将运行200个映射器