使用Hadoop，如何更改给定作业的映射器数量？

Question

So, I have two jobs, Job A and Job B. For Job A, I would like to have a maximum of 6 mappers per node. 因此，我有两个作业，作业A和作业B。对于作业A，我希望每个节点最多有6个映射器。 However, Job B is a little different. 但是，作业B有点不同。 For Job B, I can only run one mapper per node. 对于作业B，每个节点只能运行一个映射器。 The reason for this isn't important -- let's just say this requirement is non-negotiable. 这样做的原因并不重要-可以说这个要求是不可谈判的。 I would like to tell Hadoop, "For Job A, schedule a maximum of 6 mappers per node. But for Job B, schedule a maximum of 1 mapper per node." 我想告诉Hadoop，“对于作业A，每个节点最多调度6个映射器。但是对于作业B，每个节点最多调度1个映射器。” Is this possible at all? 这有可能吗？

The only solution I can think of is : 我能想到的唯一解决方案是：

1) Have two folders off the main hadoop folder, conf.JobA and conf.JobB. 1）在主hadoop文件夹下有两个文件夹conf.JobA和conf.JobB。 Each folder has its own copy of mapred-site.xml. 每个文件夹都有其自己的mapred-site.xml副本。 conf.JobA/mapred-site.xml has a value of 6 for mapred.tasktracker.map.tasks.maximum. conf.JobA / mapred-site.xml的mapred.tasktracker.map.tasks.maximum值为6。 conf.JobB/mapred-site.xml has a value of 1 for mapred.tasktracker.map.tasks.maximum. conf.JobB / mapred-site.xml的mapred.tasktracker.map.tasks.maximum值为1。

2) Before I run Job A : 2）在运行作业A之前：

2a) Shut down my tasktrackers 2a）关闭我的任务跟踪器

2b) Copy conf.JobA/mapred-site.xml into Hadoop's conf folder, replacing the mapred-site.xml that was already in there 2b）将conf.JobA / mapred-site.xml复制到Hadoop的conf文件夹中，替换其中已经存在的mapred-site.xml

2c) Restart my tasktrackers 2c）重新启动我的任务跟踪器

2d) Wait for the tasktrackers to finish starting 2d）等待任务跟踪器完成启动

3) Run Job A 3）运行作业A

and then do a similar thing when I need to run Job B. 然后在需要运行作业B时执行类似的操作。

I really don't like this solution; 我真的不喜欢这种解决方案。 it seems kludgey and failure-prone. 似乎很容易出错且容易失败。 Is there a better way to do what I need to do? 有没有更好的方法来做我需要做的事情？

Answer 1

In your Java code for the custom jar itself you could set this configuration mapred.tasktracker.map.tasks.maximum for both of your jobs. 在自定义jar本身的Java代码中，您可以为两个作业都设置此配置mapred.tasktracker.map.tasks.maximum 。

Do something like this: 做这样的事情：

Configuration conf = getConf();

// set number of mappers
conf.setInt("mapred.tasktracker.map.tasks.maximum", 4);

Job job = new Job(conf);

job.setJarByClass(MyMapRed.class);
job.setJobName(JOB_NAME);

job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(MapJob.class);
job.setMapOutputKeyClass(Text.class);
job.setReducerClass(ReduceJob.class);
job.setMapOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, args[0]);

boolean success = job.waitForCompletion(true);
return success ? 0 : 1;

EDIT : 编辑：

You also need to set the property mapred.map.tasks to the value derived from the following formula ( mapred.tasktracker.map.tasks.maximum * Number of tasktracker Nodes in your cluster) . 您还需要将属性mapred.map.tasks设置为从以下公式得出的值（mapred.tasktracker.map.tasks.maximum *集群中的tasktracker节点数）。

使用Hadoop，如何更改给定作业的映射器数量？

问题描述

1 个解决方案

解决方案1
0 2013-03-12 07:22:33

使用Hadoop，如何更改给定作业的映射器数量？

问题描述

1 个解决方案

解决方案1 0 2013-03-12 07:22:33

解决方案1
0 2013-03-12 07:22:33