简体   繁体   English

如何告诉MapReduce使用多少个映射器?

[英]How to tell MapReduce how many mappers to use?

I am trying to speed optimize MapReduce job. 我正在尝试优化MapReduce作业。

Is there any way I can tell hadoop to use a particular number of mapper/reducer processes? 有什么办法可以告诉hadoop使用特定数量的mapper / reducer进程? Or, at least, minimal number of mapper processes? 或者至少是最少数量的映射器进程?

In the documentation, it is specified, that you can do that with the method 在文档中,指定了您可以使用方法执行此操作

public void setNumMapTasks(int n)

of the JobConf class. JobConf类。

That way is not obsolete, so I am starting the Job with Job class. 那样的方法不是过时的,所以我要用Job类开始Job。 What is the right way of doing this? 正确的做法是什么?

The number of map tasks is determined by the number of blocks in the input. 映射任务的数量由输入中的块数确定。 If the input file is 100MB and the HDFS block size is 64MB then the input file will take 2 blocks. 如果输入文件为100MB,HDFS块大小为64MB,则输入文件将占用2个块。 So, 2 map tasks will be spawned. 因此,将产生2个地图任务。 JobConf.setNumMapTasks() (1) a hint to the framework. JobConf.setNumMapTasks()(1)对框架的提示。

The number of reducers is set by the JboConf.setNumReduceTasks() function. 减速器的数量由JboConf.setNumReduceTasks()函数设置。 This determines the total number of reduce tasks for the job. 这确定了作业的还原任务总数。 Also, the mapred.tasktracker.tasks.maximum parameter determines the number of reduce tasks which can run parallely on a single job tracker node. 同样,mapred.tasktracker.tasks.maximum参数确定可以在单个作业跟踪器节点上并行运行的reduce任务的数量。

You can find more information here on the number of map and reduce jobs at (2) 您可以在(2)处找到有关地图数量和减少职位的更多信息。

(1) - http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks%28int%29 (1) -http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks%28int%29
(2) - http://wiki.apache.org/hadoop/HowManyMapsAndReduces (2) -http://wiki.apache.org/hadoop/HowManyMapsAndReduces

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM