简体   繁体   English

使用Hadoop,如何更改给定作业的映射器数量?

[英]With Hadoop, how to change the number of mappers for a given job?

So, I have two jobs, Job A and Job B. For Job A, I would like to have a maximum of 6 mappers per node. 因此,我有两个作业,作业A和作业B。对于作业A,我希望每个节点最多有6个映射器。 However, Job B is a little different. 但是,作业B有点不同。 For Job B, I can only run one mapper per node. 对于作业B,每个节点只能运行一个映射器。 The reason for this isn't important -- let's just say this requirement is non-negotiable. 这样做的原因并不重要-可以说这个要求是不可谈判的。 I would like to tell Hadoop, "For Job A, schedule a maximum of 6 mappers per node. But for Job B, schedule a maximum of 1 mapper per node." 我想告诉Hadoop,“对于作业A,每个节点最多调度6个映射器。但是对于作业B,每个节点最多调度1个映射器。” Is this possible at all? 这有可能吗?

The only solution I can think of is : 我能想到的唯一解决方案是:

1) Have two folders off the main hadoop folder, conf.JobA and conf.JobB. 1)在主hadoop文件夹下有两个文件夹conf.JobA和conf.JobB。 Each folder has its own copy of mapred-site.xml. 每个文件夹都有其自己的mapred-site.xml副本。 conf.JobA/mapred-site.xml has a value of 6 for mapred.tasktracker.map.tasks.maximum. conf.JobA / mapred-site.xml的mapred.tasktracker.map.tasks.maximum值为6。 conf.JobB/mapred-site.xml has a value of 1 for mapred.tasktracker.map.tasks.maximum. conf.JobB / mapred-site.xml的mapred.tasktracker.map.tasks.maximum值为1。

2) Before I run Job A : 2)在运行作业A之前:

2a) Shut down my tasktrackers 2a)关闭我的任务跟踪器

2b) Copy conf.JobA/mapred-site.xml into Hadoop's conf folder, replacing the mapred-site.xml that was already in there 2b)将conf.JobA / mapred-site.xml复制到Hadoop的conf文件夹中,替换其中已经存在的mapred-site.xml

2c) Restart my tasktrackers 2c)重新启动我的任务跟踪器

2d) Wait for the tasktrackers to finish starting 2d)等待任务跟踪器完成启动

3) Run Job A 3)运行作业A

and then do a similar thing when I need to run Job B. 然后在需要运行作业B时执行类似的操作。

I really don't like this solution; 我真的不喜欢这种解决方案。 it seems kludgey and failure-prone. 似乎很容易出错且容易失败。 Is there a better way to do what I need to do? 有没有更好的方法来做我需要做的事情?

In your Java code for the custom jar itself you could set this configuration mapred.tasktracker.map.tasks.maximum for both of your jobs. 在自定义jar本身的Java代码中,您可以为两个作业都设置此配置mapred.tasktracker.map.tasks.maximum

Do something like this: 做这样的事情:

Configuration conf = getConf();

// set number of mappers
conf.setInt("mapred.tasktracker.map.tasks.maximum", 4);

Job job = new Job(conf);

job.setJarByClass(MyMapRed.class);
job.setJobName(JOB_NAME);

job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(MapJob.class);
job.setMapOutputKeyClass(Text.class);
job.setReducerClass(ReduceJob.class);
job.setMapOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, args[0]);

boolean success = job.waitForCompletion(true);
return success ? 0 : 1;

EDIT : 编辑

You also need to set the property mapred.map.tasks to the value derived from the following formula ( mapred.tasktracker.map.tasks.maximum * Number of tasktracker Nodes in your cluster) . 您还需要将属性mapred.map.tasks设置为从以下公式得出的值(mapred.tasktracker.map.tasks.maximum *集群中的tasktracker节点数)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM